How we made our mini-k8s on go with the helm template engine

Solving large and complex problems requires large resources – this dependence is practically unchanged. At the same time, it is not always justified to try to “eat the whole elephant” – it is often much more rational to break a complex task into a set of atomic steps and “master” them gradually.

My name is Stanislav Ivankevich. I am a senior programmer in the DataMasters development team at VK Tech. In this article, I'll discuss how we used a decomposition approach to develop our mini-k8s to automate the creation and maintenance of applications on custom Kubernetes clusters.

Context: environment

So let's get started. Let's start with the environment in which we live and work – this is important for understanding why we make certain decisions.

We work in the cloud. In our case, all cloud clients can be divided into two large groups:

  • Cloud users — people who connect and use the necessary services or power.

  • Users of a specific service — other services that are connected, for example, in the process of creating a multi-component application architecture. For example, when one tool requires another to work.

At the same time, regardless of who exactly the user is in a particular scenario, the cloud must always perform very specific tasks:

  • Create for the user what he wants on his own resources: allocate a resource, provide a tool, deploy a working environment, and so on.

  • Give the user access (sometimes even full) to the created resources.

  • Maintain the functionality of created resources.

The caveat is that maintaining such a system can be problematic. Moreover, there are three risk zones at once.

  • Iron. Physical equipment is not eternal and unpredictable – even despite the maintenance and regular updating of the hardware fleet, at any moment, in any data center, anything can fail.

  • Neighboring Services. Services do not live in a vacuum; they often depend on each other. And the more extensive the service’s functionality, the more dependencies it has. That is, a failure at the level of just one service can lead to degradation at the level of the entire cloud.

  • User. The user decides how to use the allocated resources. This creates the risk that client actions (delete, change, create) will unknowingly trigger failures.

To minimize such risks, but still solve user problems, there is a strict division into two layers in the cloud: data plane And control plane.

  • Control plane (control layer). It includes components that ensure the functionality of the entire cloud: services, APIs, scripts and others. The user has access to them only through the public API.

  • Data plane (user data layer). These are the components that a cloud user owns and uses: virtual machines, Kubernetes clusters, databases, etc. The user can dispose of these resources at his own discretion.

In this division, it is important that there are no control structures in the data plane (with rare exceptions, for example, in the form of monitoring agents), and the creation and management of data plane resources is performed using the control plane.

From context to task

Based on the requirement to separate layers into data plane and control plane, we must perform any operations as a cloud provider through the control plane.

At the same time, at some point we were faced with the need to automate the creation and support of applications in custom Kubernetes clusters. The first such application was Apache Kafka.

In theory, this is not difficult to implement – Kubernetes has an operator mechanism, so it is enough to install an operator on the user K8s, through which you can then create and manage Kafka. The nuance is that the K8s cluster is located in the data plane, that is, the user can potentially delete the operator or even the entire cluster. This is obviously unacceptable, which is why we, as a cloud provider, require special control over the said operator.

The situation is similar with disks, networks, storage, access and other components – in theory, all processes can be automated through the custom Kubernetes operator. But the K8s operator is located in the data plane, so we need an additional “guarantor of stability”. This is a tool for complete control from the control plane.

So we came to the need to create our own mini-k8s for Kubernetes and everything around.

Architecture

When building our implementation of mini-k8s, we were inspired by individual features and capabilities built into the foundation of Kubernetes. In addition, we borrowed a declarative management system through manifests and templating through helm, which we connected as a library in our service. Such a combination would potentially allow us to completely solve the problem with the unpredictable data plane.

When choosing an architecture, we considered several options and ultimately settled on a scheme of three components:

  • Gate. The service entry point through which all requests are received. For example, to create a cluster or install an application. This is where sets of manifests are generated based on user requests from our helm charts.

  • Storage. An abstraction over a database that can work with manifests. In our case, we moved it to a hotel microservice. Manifests generated in the gate arrive in storage, and this is where they are stored.

  • Controllers. Similar to controllers in Kubernetes operators, controllers are the main workhorse of our entire system. Controllers are a set of controllers that do all the basic work. Controllers receive manifests from storage, process them and bring the data-plane to the state specified in the manifests.

Concept

When defining the general concept of implementation, we proceeded from the fact that we can decompose any large and complex task into a number of simpler and smaller ones. At the same time, it is important that each task is formulated in a declarative format – it describes not what and how to do, but what should happen in the end.

The main condition is that each task can be completed independently of the others.

At the same time, we admit that some tasks may depend on the results of other tasks. The result is a kind of execution tree.

When projecting a similar decomposition algorithm onto the tasks of creating Kafka on Kubernetes, we received the following diagram.

For example, the topmost block of the diagram is “Network and subnet”, on which the “k8s Cluster” block depends – this hierarchy perfectly demonstrates that before creating a Kubernetes cluster, you need to create a network and subnet. In this case, for example, load balancers or IPs can be created in parallel and independently.

Manifests and controllers

The two main entities of our entire system are manifests and controllers.

  • manifestos contain a description of how things should be;

  • controllers know how to get from almost any situation to the state described in the manifest.

This is essentially how operators work in Kubernetes itself.

At the same time we have:

  • manifests are similar to crd in k8s, but are processed not by operators in custom Kubernetes, but by our service in the control plane;

  • one controller processes strictly one type of manifest.

But there are also exceptions. For example, apply controller simply applies the k8s manifest to the user cluster. In this case, the spec of our apply manifest simply contains the required k8s manifest – for example, to create a namespace or install ingress.

The next important step is to coordinate the work of manifests and controllers. In Kubernetes operators, a reconciliation loop is provided for this – a coordination cycle or reconciliation cycle performed by the controller in the Kubernetes operator.

We have implemented something similar in our system – task puller.

He is responsible for two tasks:

  • Decides when to transfer the task of processing the manifest to the desired controller. For example, in case of changing the contents of the manifest or changing the entity that was created according to this manifest.

  • Sends manifests to controllers for processing at a certain frequency, even if no events occurred on them. Just in case.

Having implemented such a decomposition, we get a set of yaml manifests, each of which performs a small part of the overall work.

By collecting these manifests together and templating them, we got a new application in the helm chart.

Code

Now let’s move on from abstract things to more applied ones and consider what is “under the hood” of each component of our circuit at the code level.

Storage

For storage we implemented a simple grpc api with two methods:

Initially, we thought of making an implementation in which we could accept and return manifests without transformations – in yaml. But in this case, working with the API would be greatly complicated by unstructured requests, so we abandoned this idea.

Instead, we've identified the core mandatory elements of the manifest and spelled them out in the contract. Only the contents of the spec field remain relatively free – its structure will be unique for each type of manifest.

service Storage {
 rpc Apply(ApplyReq) returns (ApplyRes) {}
 rpc List(ListReq) returns (ListRes) {}
}

message ApplyReq {
 repeated Manifest manifests = 1;
}

message ListReq {
 uint64 offset = 1;
 uint64 limit = 2;
 repeated string groupIds = 3;
 repeated string ids = 4;
}

message Manifest {
 string apiVersion = 1;
 string kind = 2;
 Metadata metadata = 3;
 google.protobuf.Struct spec = 4;
 Status status = 5;
}

It is noteworthy that the structure of our manifests is identical to the structure of Kubernetes manifests. In addition, they map directly to the grpc structure. Our manifests include the following fields:

  • apiVersion — handler version;

  • kind — handler name;

  • ​​metadata – metadata, identifiers or labels;

  • spec – specification, basic information required by the controller to operate;

  • status — target state and current status.

Moreover, metadata has two required fields:

  • uid — identifier of the manifest itself;

  • guid — identifier of the manifesto group (the group includes manifestos from the same helm chart).

apiVersion: magnum.vkcs.cloud/v0
kind: Cluster
metadata:
 uid: "{{ .Values.clusterID }}"
 guid: "{{ .Values.groupID }}"
 dataPlatform:
   projectID: "{{ .Values.projectID }}"
 name: "cluster"
spec:
 ...
status:
 goal: exists

message Metadata {
 string uid = 1;
 string guid = 2;
 map<string, string> labels = 3;
 string name = 5;
}

message Manifest {
 string apiVersion = 1;
 string kind = 2;
 Metadata metadata = 3;
 google.protobuf.Struct spec = 4;
 Status status = 5;
}

Controller

Now about the controllers.

There can be many controllers – for different checks, conditions and actions. But they all have the same basic structure.

In fact, the controller should implement only 3 interface methods:

  • HandleExist() – called if, according to the manifest, it is necessary to carry out operations to bring the system to the desired state.

  • HandleDelete() – called when you need to delete and, if possible, “clean up after yourself.”

  • Kinds() – returns a list of kinds manifests that this controller processes (in most cases there is only one kinds).

type Controller interface {

  HandleExist(context.Context, Req) (*Res, error)

  HandleDelete(context.Context, Req) (*Res, error)

  Kinds() []string

In this case, both handle methods of the controller (HandleExist and HandleDelete) receive a certain request as input and must return a certain response.

  • Request is a small interface, part of the manifest that is important for the controller to work. Spec is represented here as any (Spec() any).

  • Response is a ready-made structure with a set of fields that the controller can change in the manifest. For example, the status and time of the next forced start.

type Controller interface {

  HandleExist(context.Context, Req) (*Res, error)

  HandleDelete(context.Context, Req) (*Res, error)

  Kinds() []string

}

// ---

type Req interface {

  Spec() any

  Meta() domain.Meta

}

type Res struct {

  Status  string

  StartUp *time.Time

}

The implementation of request is the manifest itself – in fact, the request interface simply limits the range of data available to the controller.

type Req interface {

  Spec() any

  Meta() domain.Meta

}

type Res struct {

  Status  string

  StartUp *time.Time

}

// ---

type Manifest struct {

  spec   spec

  meta   Meta

  status Status

}

func (m Manifest) Kind() string {

  return m.spec.Kind()

}

type spec interface {

  Kind() string

}

It is noteworthy that the kind of the manifesto is thrown from the spec. This was done because all manifests are structurally identical and most often the differences are in the spec.

One example of the simplest spec is the namespace manifest spec.

type Manifest struct {

  spec   spec

  meta   domain.Meta

  status *domain.Status

}

func (m Manifest) Kind() string {

  return m.spec.Kind()

}

type spec interface {

  Kind() string

}

// ---

type Namespace struct {

  kind[Namespace]

  Namespace string `json:"namespace"`

}

type kind[T any] struct{}

func (m kind[T]) Kind() string {

  return reflect.TypeOf(*new(T)).Name()

}

It contains just one field called namespace and embeds a simple generic kind structure[T any]. And kind[T any] is a small hack that allows you to avoid manually writing a name for each spec.

The moment of converting a grpc struct into a specific spec structure is especially interesting here – it involves sequential conversion of a grpc structure into json, and from json into the structure of the desired spec (pb -> json -> spec).

func specFactory[T Spec](str *structpb.Struct) Spec {

  s := new(T)

  j, err := str.MarshalJSON()

  if err != nil {return nil}

  if err := json.Unmarshal(j, s); err != nil {return nil}

  return *s

}

It is noteworthy that this is a slow but versatile operation. We have a mechanism that allows, if necessary, to switch to “fast” transformations, but in this case it is unjustified.

For clarity, I’ll show you what a small controller responsible for namespace in Kubernetes could look like.

func (n Namespace) HandleExist(

  ctx context.Context, 

  req Req,

) (*Res, error) {

  s, ok := req.Spec().(spec.Namespace)

  if !ok {

     return fmt.Errorf("invalid manifest spec")

  }

  _, err := n.k8sClient.NSGet(ctx, s.Namespace)

  if err == nil {return nil}

  if !k8sErrors.IsNotFound(err) {

    return nil, fmt.Errorf("get namespace, %w", err)

  }

  err = n.k8sClient.NSCreate(ctx, s.Namespace)

  if err != nil {

     return fmt.Errorf("create namespase, %w", err)

  }

  return &Res{

    Status: domain.StatusSuccess,

  }, nil

}

Everything is relatively simple here:

  • we get the manifest spec and immediately cast it to the desired type;

  • Next, we pull kube-api to check if the requested namespace exists;

  • if there is, we do nothing; if there is no such manifest and the request did not return an error, we try to create it;

  • in case of successful creation, we mark the manifest as successfully completed;

  • then the worker checks the changes made by the controller to the status and saves them to the database.

Gate

Now let's talk more about gate and how we use helm.

Let's start with a short excursion into helm.

So, helm uses the famous Cobra framework.

There are many commands for working with helm, but in our case one is important – helm install. For example:

helm install panda bitnami/wordpress

It starts the installation and downloads manifests to the cluster. The nuance is that we need a slightly different execution scenario, so we add dry-run and debug to the command:

helm install panda bitnami/wordpress --dry-run –-debug

In this case, instead of using manifests, helm will pipe the content to the console.

At the Cobra framework level, the code with our helm install command looks something like this:

cmd := &cobra.Command{

  Use:   "install [NAME] [CHART]",

  Short: "install a chart",

  Long:  installDesc,

  Args:  require.MinimumNArgs(1),

  ValidArgsFunction: func(cmd *cobra.Command, args []string, toComplete string) ([]string, cobra.ShellCompDirective) {

     return compInstall(args, toComplete, client)

  },

  RunE: func(_ *cobra.Command, args []string) error {

    …

    rel, err := runInstall(args, client, valueOpts, out)

    if err != nil {

      return errors.Wrap(err, "INSTALLATION FAILED")

    }

    …

  },

}

The most important thing here is the runInstall method. This is where the main “magic” of the installation lies. Let's look at it in more detail.

func runInstall(

  args []string, 

  client *action.Install, 

  valueOpts *values.Options, 

  out io.Writer,

) (*release.Release, error) {

  ...

  vals, err := valueOpts.MergeValues(p)

  ...

  chartRequested, err := loader.Load(cp)

  ...

  return client.RunWithContext(ctx, chartRequested, vals)

}

The example shows that the main logic is concentrated in the action client – the loaded chart and values ​​must be passed to its RunWithContext method.

func runInstall(

  args []string, 

  client *action.Install, 

  valueOpts *values.Options, 

  out io.Writer,

) (*release.Release, error) {

  ...

  vals, err := valueOpts.MergeValues(p)

  ...

  chartRequested, err := loader.Load(cp)

  ...

  return client.RunWithContext(ctx, chartRequested, vals)

}

This client is located in pkg (helm.sh/helm/v3/pkg/action), which makes it easy for us to use.

Usage example

Now you can move from theory and general overview to a use case.

1. First of all, import the necessary packages from helm. For basic use cases, you only need two: action and chart, which contains the structure, oddly enough, of a chart.

import (

 "helm.sh/helm/v3/pkg/action"

 "helm.sh/helm/v3/pkg/chart"

2. Next, create an instance of the install structure. If you only need to generate a manifest, be sure to add dry-run and client-only to true. At the same time, you can work with Kubernetes through the helm installer from your application – in this case, working with kube-api is simplified.

At the same stage, be sure to enter the release-name (it won’t be possible to launch without this) and namespace.

client := action.NewInstall(&action.Configuration{})

client.DryRun = true

client.ClientOnly = true

client.ReleaseName = "release-name"

client.Namespace = "namespace"

Values ​​from the release can be obtained in templates. It is noteworthy that release values ​​are also often used in helm charts, so it is better not to neglect them.

3. At the next stage, we download the files of our chart and an instance of the chart. The method depends on the specific case. So, if the charts are available next to the binary code, you can use the loader that helm itself uses. If not, you need to come up with an implementation option. For example, you can use the standard loader. But in this case, the files will need to be transferred manually from the code (rather than just specifying the path).

chartReq, err := loader.Load("helmName")

if err != nil {log.Fatal(err)}

...

type Loader interface {

  Load(chartName string) (*chart.Chart, error)

}

4. After preparing the installer client and the chart instance, you can start the installation. It is noteworthy that nothing can be passed as value – in this case, the values ​​​​from the corresponding file will be used. When passing a valueMap, these values ​​will be overridden.

rel, err := client.RunWithContext(ctx, chartReq, valueMap)

if err != nil {log.Fatal(err)}

At this stage, we already receive the necessary manifests, which will be located in rel.Manifest.

5. Next, we convert a set of yaml manifests into grpc structures and send them to storage.

manifests, err := manifestsFromYAML(rel.Manifest)

if err != nil {

  log.Fatal(err)

}

...

cc := NewClientConnect()

storageClient := storage.NewStorageClient(cc)

req := storage.ApplyReq{Manifests: manifests}

_, err = storageClient.Apply(ctx, &req)

if err != nil {

  log.Fatal(err)

}

We should also consider the method of converting a set of yaml manifests into a slice of grpc manifests for subsequent sending to storage. The simplified algorithm is as follows.

  • First, one large yaml file is split into a slice of individual yaml manifests.

  • Then, in a loop, we convert it to json.

  • Further, since the yaml manifest structure must be identical to the grpc structure, unmarshal the resulting json in proto.

  • We put everything together into one big slice and return the result.

type ProtoMessage[T any] interface {

  protoreflect.ProtoMessage

  *T

}

func ManifestsFromYAML[T any, S ProtoMessage[T]](manifestYAML string) ([]S, error) {

  yamls := strings.Split(manifestYAML, "\n---\n")

  manifests := make([]S, 0, len(yamls))

  for _, yamlStr := range yamls {

     if yamlStr == "" {continue}

     jsonBytes, err := yaml.YAMLToJSON([]byte(yamlStr))

     if err != nil {...}

     var man T

     if err := protojson.Unmarshal(jsonBytes, S(&man)); err != nil {...}

     manifests = append(manifests, &man)

  }

  return manifests, nil

}

“Rake” on our way

Like any other solution, there were quite a few non-obvious aspects in our implementation. So, for various reasons, we have already encountered some “rake”.

  • Over-reliance on choreography. We relied too much on choreography and complete controller independence. The main problem here is that checks often had to be duplicated. For example, if the manifest responsible for creating a cluster and the manifest responsible for installing an application on the cluster are completely independent, then they both need to check whether the cluster exists and is complete. And this is at least two requests, instead of one. Let's multiply this by a couple of dozen manifestos, then by another hundred services deployed by us – and so on every minute. The result is self-DDoS.

To solve this, we added dependencies to the manifests. In this case, the controller can only accept manifests for which all dependent manifests have the “successful” status. Thus, only relevant checks were concentrated in the controllers.

  • Explosive growth of charts when functionality is added. Application charts tend to grow as functionality increases. This makes them increasingly difficult to maintain. We decided to make more specialized and, accordingly, smaller charts. For example, a chart for creating a cluster or a chart for installing a specific application (for example, Kafka). This simplified the charts and added another layer of design – from the charts.

  • User request races occur. We did not take into account the possibility of a race among user requests. Because of this, for example, situations could arise when a user managed to add an application to a cluster that he had just sent for deletion.

The problem was fixed by saving the application being created on the gate side and introducing state machine mechanisms to explicitly limit possible actions with applications that are in certain states.

Results and some thoughts from our experience

Trying to “eat the whole elephant” is not a good idea. Therefore, we initially decided to go a different route:

  • took a complex task in the cloud and broke it down into a set of atomic steps;

  • for each step we created our own helm-style manifest;

  • We wrote our own handler for each manifest.

The result is a set of small, versatile blocks that, like Lego bricks, can be used to build large applications. This allowed us to get all the benefits of helm. Moreover, during the implementation process we were able to use publicly available developments and principles.

Based on the developed service, we formed a layer of business logic on finite state machines to manage applications. This layer allows developers to abstract away from working with the infrastructure and only change application parameters. This facilitates the process of introducing new applications, as well as organizing interaction between them. We will talk about this in a separate article.

Of course, following this approach did not save us from all difficulties and did not eliminate the appearance of “rake” along the way, but it definitely helped us solve the problem of automating the creation and deployment of applications in custom Kubernetes clusters with less effort and without complex “crutches”. At the same time, we continue to develop and improve the product to gain even more opportunities. The results of these improvements will be discussed in the following articles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *