The many faces of controller-manager and countless api-servers

What to do when the cluster turns into a pumpkin? We began to answer these questions in the previous article. Today we will continue to look at ways to build large-volume clusters.

The first step dBrain.cloud on the way to building large-volume clusters was the division of etcd. Next we took on controller-manager.

How to avoid wasting time collecting garbage?

Controller-managers have a “controllers” flag. There are many controllers inside, each of which is responsible for its own objects: replicasets, pods, etc. When running, controller-manager does not scale horizontally. Example: there are three replica controller-managers, they choose masters among themselves – and in the end only one works.

What functions does controller-manager perform? It checks the state of the cluster with the reference one. Controller-manager constantly walks around the cluster and compares the current state with the one recorded in etcd, Kubernetes, etc. Controller-manager checks all object descriptions in Kubernetes and looks for objects without an owner, that is, dynamically created by Kubernetes, and not by users. On one of our high-load projects with more than 100 thousand pods, we were faced with the fact that the controller-manager spent 90 percent of its time on garbage collection. He simply did not have time to do his immediate work: checking the number of pods, deployment replicas, services, recording addresses, etc.

Thanks to the “controllers” flag, we divided the controller-manager into three clusters. It turned out like with etcd: three clusters with three controller-managers in each, and one of them is the master. Thus, we have three controller-manager instances working in parallel and performing different tasks.

Now the controller-manager operation scheme looks like this: the first one took over the entire workload: ReplicaSet, Services, Deployments, StatefulSet, etc., the second one – garbage collection, ttl calculation, etc., the third one – everything else: node, pv, pvc, certs, etc.

This way we increased the cluster’s responsiveness to various changes. Garbage collection occurs in a separate controller-manager, the process can go on continuously, now this does not affect the operation of the cluster.

HPA for load regulation and envoy

In dBrain, as elsewhere, api servers are located in static pods and are now used as bootstrap ones to launch clusters, the overlay network and the main api servers are deployed. We placed their certificates in Kubernetes secrets, now api servers can run throughout the entire cluster, and not be statically located on some nodes. Thanks to such manipulations, all the advantages of Kubernetes appear in the form of fancy autoscaling and fault tolerance. Now we have configured HPA (Horizontal Pod Autoscaler) on the api servers in dBrain; they are automatically scaled depending on utilization. We hope you find our life hack useful 🙂

dBrain has an envoy (L4-L7 proxy) that lives on every host. It works as a proxy to static pods of api servers and to api servers in deployments and makes the kube api available on all hosts in the cluster so that kubelets, controller-managers and schedulers can work. And here the eternal question arises: which came first – the chicken or the egg? The cluster must rise, and for this we need working api servers. If the api servers are deployed, the cluster will not rise. Therefore, in envoy, priorities for health checks are implemented, if the api servers in a flexible horizontally scaling deployment are available – all traffic goes to them, otherwise – to the api servers in static pods.

In the platform of earlier versions, we had an envoy through which Kubernetes system components went to api servers in static pods. Previously, api servers in deployments processed requests exclusively from within the cluster (from the kubernetes.default.svc service), and users who work with the cluster also connected to them. The system components worked exclusively with API servers in static pods.

As for the scheduler, unlike the controller-manager, it does not need to be scaled. Scheduler in one instance is a component that does not experience constant load and works in an event-based manner. For example, on large clusters it does not scrape out all the hosts in an attempt to schedule something somewhere. It takes about 10 percent of hosts by default and tries to schedule pods on them; if that doesn’t work, it takes the next 10 percent and so on.

To summarize: by separating etcd and controller-manager, we can work with large clusters as quickly as with small ones. Thanks to two actions, we were able to eliminate problems that, as the volume of the cluster grew, led to its inadequate operation and other side effects.

Now we have a cluster of any size much more responsive: its parts no longer influence each other. When administering a cluster via kubectl, the component does not lag, regardless of the size of the cluster. Services tied to the resources of Kubernetes itself began to work instantly. We are going to 200 thousand pods in the cluster, don’t switch 😉

If you found our article useful, please share it with your colleagues or friends – engaging and reading the materials fills us with enthusiasm to find and uncover new interesting topics for you.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *