Kubernetes: why is it so important to set up system resource management?

As a rule, there is always a need to provide a dedicated pool of resources to any application for its correct and stable operation. But what if several applications work on the same capacities at once? How to provide the minimum necessary resources for each of them? How can I limit resource consumption? How to properly distribute the load between nodes? How to ensure the operation of the horizontal scaling mechanism in case of increased load on the application?

You need to start with what basic types of resources exist in the system – this, of course, processor time and RAM. In k8s manifests, these types of resources are measured in the following units:

  • CPU – in the cores
  • RAM – in bytes

Moreover, for each resource there is an opportunity to set two types of requirements – requests and limits. Requests – describes the minimum requirements for the free resources of the node to run the container (and the hearth as a whole), while limits sets a strict limit on the resources available to the container.

It is important to understand that in the manifest it is not necessary to explicitly define both types, and the behavior will be as follows:

  • If only the limits of the resource are explicitly set, then requests for this resource automatically take a value equal to limits (this can be verified by calling describe entities). Those. in fact, the operation of the container will be limited by the same amount of resources that it requires to run.
  • If only requests are explicitly set for a resource, then no restrictions are set on this resource from above – i.e. the container is limited only by the resources of the node itself.

It is also possible to configure resource management not only at the level of a specific container, but also at the namespace level using the following entities:

  • Limitrange – describes the restriction policy at the container / hearth level in ns and is needed to describe the default restrictions on the container / hearth, as well as to prevent the creation of obviously fat containers / hearths (or vice versa), limit their number and determine the possible difference in the values ​​of limits and requests
  • ResourceQuotas – describe the restriction policy as a whole for all containers in ns and is used, as a rule, to delimit resources by environments (useful when environments are not rigidly delimited at the level of nodes)

The following are examples of manifests where resource limits are set:

  • At the specific container level:

    containers:
    - name: app-nginx
      image: nginx
      resources:
        requests:
          memory: 1Gi
        limits:
          cpu: 200m

    Those. in this case, to start a container with nginx, you will need at least the presence of free 1G OP and 0.2 CPU on the node, while the maximum container can eat 0.2 CPU and all available OP on the node.

  • At integer level ns:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: nxs-test
    spec:
      hard:
        requests.cpu: 300m
        requests.memory: 1Gi
        limits.cpu: 700m
        limits.memory: 2Gi

    Those. the sum of all request containers in the default ns cannot exceed 300m for the CPU and 1G for the OP, and the sum of all limit is 700m for the CPU and 2G for the OP.

  • Default restrictions for containers in ns:

    apiVersion: v1
    kind: LimitRange
    metadata:
      name: nxs-limit-per-container
    spec:
     limits:
       - type: Container
         defaultRequest:
           cpu: 100m
           memory: 1Gi
         default:
           cpu: 1
           memory: 2Gi
         min:
           cpu: 50m
           memory: 500Mi
         max:
           cpu: 2
           memory: 4Gi

    Those. in the default namespace for all containers, by default, request will be set to 100m for the CPU and 1G for the OP, limit – 1 CPU and 2G. At the same time, a restriction was also set on the possible values ​​in request / limit for CPU (50m

  • Limitations on the ns hearth level:

    apiVersion: v1
    kind: LimitRange
    metadata:
     name: nxs-limit-pod
    spec:
     limits:
     - type: Pod
       max:
         cpu: 4
         memory: 1Gi

    Those. for each hearth in the default ns, a limit of 4 vCPU and 1G will be set.

Now I would like to tell you what advantages the installation of these restrictions can give us.

The mechanism of load balancing between nodes

As is known, the k8s component such as schedulerwhich works according to a specific algorithm. This algorithm in the process of choosing the optimal node to run goes through two stages:

  1. Filtration
  2. Ranging

Those. according to the described policy, nodes are initially selected on which the hearth can be launched on the basis of a set predicates (including checking if the node has enough resources to start the pod – PodFitsResources), and then for each of these nodes, according to priorities points are awarded (including, the more free resources a node has – the more points it is assigned – LeastResourceAllocation / LeastRequestedPriority / BalancedResourceAllocation) and is run on the node with the highest number of points (if several nodes satisfy this condition at once, then a random one is selected) .

At the same time, you need to understand that the scheduler, when evaluating the available resources of the node, focuses on the data stored in etcd – i.e. by the amount of the requested / limit resource of each pod running on this node, but not by the actual consumption of resources. This information can be obtained in the output of the command. kubectl describe node $NODE, eg:

# kubectl describe nodes nxs-k8s-s1
..
Non-terminated Pods:         (9 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                         ------------  ----------  ---------------  -------------  ---
  ingress-nginx              nginx-ingress-controller-754b85bf44-qkt2t    0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                kube-flannel-26bl4                           150m (0%)     300m (1%)   64M (0%)         500M (1%)      233d
  kube-system                kube-proxy-exporter-cb629                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                kube-proxy-x9fsc                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                nginx-proxy-k8s-worker-s1                    25m (0%)      300m (1%)   32M (0%)         512M (1%)      233d
  nxs-monitoring             alertmanager-main-1                          100m (0%)     100m (0%)   425Mi (1%)       25Mi (0%)      233d
  nxs-logging                filebeat-lmsmp                               100m (0%)     0 (0%)      100Mi (0%)       200Mi (0%)     233d
  nxs-monitoring             node-exporter-v4gdq                          112m (0%)     122m (0%)   200Mi (0%)       220Mi (0%)     233d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                487m (3%)          822m (5%)
  memory             15856217600 (2%)  749976320 (3%)
  ephemeral-storage  0 (0%)             0 (0%)

Here we see all the pods running on a particular node, as well as the resources that each of the pods requests. And here is what scheduler logs look like when starting the cronjob-cron-events-1573793820-xt6q9 pod (this information will appear in the scheduler log when setting the 10th level of logging in the arguments of the launch command –v = 10):

broad gull

I1115 07:57:21.637791       1 scheduling_queue.go:908] About to try and schedule pod nxs-stage/cronjob-cron-events-1573793820-xt6q9                                                                                                                                           
I1115 07:57:21.637804       1 scheduler.go:453] Attempting to schedule pod: nxs-stage/cronjob-cron-events-1573793820-xt6q9                                                                                                                                                    
I1115 07:57:21.638285       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s5 is allowed, Node is running only 16 out of 110 Pods.                                                                               
I1115 07:57:21.638300       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s6 is allowed, Node is running only 20 out of 110 Pods.                                                                               
I1115 07:57:21.638322       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s3 is allowed, Node is running only 20 out of 110 Pods.                                                                               
I1115 07:57:21.638322       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s4 is allowed, Node is running only 17 out of 110 Pods.                                                                               
I1115 07:57:21.638334       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, Node is running only 16 out of 110 Pods.                                                                              
I1115 07:57:21.638365       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s12 is allowed, Node is running only 9 out of 110 Pods.                                                                               
I1115 07:57:21.638334       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s11 is allowed, Node is running only 11 out of 110 Pods.                                                                              
I1115 07:57:21.638385       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s1 is allowed, Node is running only 19 out of 110 Pods.                                                                               
I1115 07:57:21.638402       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s2 is allowed, Node is running only 21 out of 110 Pods.                                                                               
I1115 07:57:21.638383       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, Node is running only 16 out of 110 Pods.                                                                               
I1115 07:57:21.638335       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, Node is running only 18 out of 110 Pods.                                                                               
I1115 07:57:21.638408       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s13 is allowed, Node is running only 8 out of 110 Pods.                                                                               
I1115 07:57:21.638478       1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, existing pods anti-affinity terms satisfied.                                                                         
I1115 07:57:21.638505       1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, existing pods anti-affinity terms satisfied.                                                                          
I1115 07:57:21.638577       1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, existing pods anti-affinity terms satisfied.                                                                          
I1115 07:57:21.638583       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s7 is allowed, Node is running only 25 out of 110 Pods.                                                                               
I1115 07:57:21.638932       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 9        
I1115 07:57:21.638946       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 8           
I1115 07:57:21.638961       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: BalancedResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 9        
I1115 07:57:21.638971       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7        
I1115 07:57:21.638975       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: LeastResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 8           
I1115 07:57:21.638990       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7           
I1115 07:57:21.639022       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: TaintTolerationPriority, Score: (10)                                                                                                        
I1115 07:57:21.639030       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: TaintTolerationPriority, Score: (10)                                                                                                         
I1115 07:57:21.639034       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: TaintTolerationPriority, Score: (10)                                                                                                         
I1115 07:57:21.639041       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: NodeAffinityPriority, Score: (0)                                                                                                            
I1115 07:57:21.639053       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: NodeAffinityPriority, Score: (0)                                                                                                             
I1115 07:57:21.639059       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: NodeAffinityPriority, Score: (0)                                                                                                             
I1115 07:57:21.639061       1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: InterPodAffinityPriority, Score: (0)                                                                                                                   
I1115 07:57:21.639063       1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10)                                                                                                                   
I1115 07:57:21.639073       1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: InterPodAffinityPriority, Score: (0)                                                                                                                    
I1115 07:57:21.639077       1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10)                                                                                                                    
I1115 07:57:21.639085       1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: InterPodAffinityPriority, Score: (0)                                                                                                                    
I1115 07:57:21.639088       1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10)                                                                                                                    
I1115 07:57:21.639103       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10)                                                                                                         
I1115 07:57:21.639109       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10)                                                                                                          
I1115 07:57:21.639114       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10)                                                                                                          
I1115 07:57:21.639127       1 generic_scheduler.go:781] Host nxs-k8s-s10 => Score 100037                                                                                                                                                                            
I1115 07:57:21.639150       1 generic_scheduler.go:781] Host nxs-k8s-s8 => Score 100034                                                                                                                                                                             
I1115 07:57:21.639154       1 generic_scheduler.go:781] Host nxs-k8s-s9 => Score 100037                                                                                                                                                                             
I1115 07:57:21.639267       1 scheduler_binder.go:269] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10"                                                                                                               
I1115 07:57:21.639286       1 scheduler_binder.go:279] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10": all PVCs bound and nothing to do                                                                             
I1115 07:57:21.639333       1 factory.go:733] Attempting to bind cronjob-cron-events-1573793820-xt6q9 to nxs-k8s-s10

Here we see that initially the scheduler performs filtering and forms a list of 3 nodes on which launch is possible (nxs-k8s-s8, nxs-k8s-s9, nxs-k8s-s10). Then it scores for several parameters (including BalancedResourceAllocation, LeastResourceAllocation) for each of these nodes in order to determine the most suitable node. As a result, under it, it is planned on the node with the most points (here, two nodes have the same number of points 100037, so a random one is selected – nxs-k8s-s10).

Conclusion: if pods work on a node for which no restrictions are set, then for k8s (from the point of view of resource consumption) this will be equivalent to as if such pods were completely absent on this node. Therefore, if you, by convention, have a pod with a voracious process (for example, wowza) and there are no restrictions for it, then a situation may arise when, in fact, the given one has eaten all the resources of the node, but at the same time for k8s this node is considered unloaded and it will be awarded the same number of points when ranking (namely in points with an assessment of available resources), as well as a node that does not have work pitches, which ultimately can lead to an uneven distribution of the load between the nodes.

Hearth eviction

As you know, each of the pods is assigned one of the 3 QoS classes:

  1. guaranuted – assigned then, for each container in the hearth for memory and cpu request and limit are set, and these values ​​must match
  2. burstable – at least one container in the hearth has request and limit, while request
  3. best effort – when no container in the hearth is limited in resources

At the same time, when there is a lack of resources (disk, memory) on the node, kubelet starts ranking and evicting sub’s according to a certain algorithm that takes into account the priority of the pod and its QoS class. For example, if we are talking about RAM, then based on the QoS class points are awarded according to the following principle:

  • Guaranteed: -998
  • Bestoffff: 1000
  • Burstable: min (max (2, 1000 – (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Those. with the same priority, kubelet will first expel the pods with the best effort QoS class from the node.

Conclusion: if you want to reduce the likelihood of eviction of the required pod from the node in case of lack of resources on it, then along with the priority, you must also take care of setting the request / limit for it.

Application hearth horizontal auto-scaling mechanism (HPA)

When the task is to automatically increase and decrease the number of pod depending on the use of resources (system – CPU / RAM or user – rps), such an entity k8s as can help in its solution HPA (Horizontal Pod Autoscaler). The algorithm of which is as follows:

  1. The current reading of the observed resource is determined (currentMetricValue)
  2. The desired values ​​for the resource (desiredMetricValue) are determined, which are set for system resources using request
  3. Determines the current number of replicas (currentReplicas)
  4. The following formula calculates the desired number of replicas (desiredReplicas)
    desiredReplicas = [ currentReplicas * ( currentMetricValue / desiredMetricValue )]

However, scaling will not occur when the coefficient (currentMetricValue / desiredMetricValue) is close to 1 (we can set the allowable error ourselves, by default it is 0.1).

Consider hpa using the app-test application (described as Deployment), where you need to change the number of replicas, depending on CPU usage:

  • Application manifest

    kind: Deployment
    apiVersion: apps/v1beta2
    metadata:
     name: app-test
    spec:
     selector:
       matchLabels:
         app: app-test
     replicas: 2
     template:
       metadata:
         labels:
           app: app-test
       spec:
         containers:
         - name: nginx
           image: registry.nixys.ru/generic-images/nginx
           imagePullPolicy: Always
           resources:
             requests:
               cpu: 60m
           ports:
           - name: http
             containerPort: 80
           - name: nginx-exporter
             image: nginx/nginx-prometheus-exporter
             resources:
               requests:
                 cpu: 30m
             ports:
             - name: nginx-exporter
               containerPort: 9113
             args:
             - -nginx.scrape-uri
             - http://127.0.0.1:80/nginx-status

    Those. we see that under with the application it is initially launched in two instances, each of which contains two containers nginx and nginx-exporter, for each of which is given requests for CPU.

  • HPA manifest

    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    metadata:
     name: app-test-hpa
    spec:
     maxReplicas: 10
     minReplicas: 2
     scaleTargetRef:
       apiVersion: extensions/v1beta1
       kind: Deployment
       name: app-test
     metrics:
     - type: Resource
       resource:
         name: cpu
         target:
           type: Utilization
           averageUtilization: 30

    Those. we created an hpa that will monitor the Deployment app-test and adjust the number of hearths with the application based on the cpu indicator (we expect that the hearth should consume 30% percent of the CPU requested by it), while the number of replicas is in the range of 2-10.

    Now, we will consider the hpa operation mechanism if we apply a load to one of the hearths:

     # kubectl top pod
     NAME                                                   CPU(cores)   MEMORY(bytes)
     app-test-78559f8f44-pgs58            101m         243Mi
     app-test-78559f8f44-cj4jz            4m           240Mi

Total we have the following:

  • Desired value (desiredMetricValue) – according to the hpa settings, we have 30%
  • Current value (currentMetricValue) – for calculation, the controller-manager calculates the average resource consumption in%, i.e. conditionally does the following:
    1. Gets the absolute values ​​of hearth metrics from the metric server, i.e. 101m and 4m
    2. Calculates the average absolute value, i.e. (101m + 4m) / 2 = 53m
    3. Gets the absolute value for the desired resource consumption (for this, the request of all containers is summed) 60m + 30m = 90m
    4. Calculates the average percentage of CPU consumption relative to the request hearth, i.e. 53m / 90m * 100% = 59%

Now we have everything necessary to determine whether it is necessary to change the number of replicas, for this we calculate the coefficient:

ratio = 59% / 30% = 1.96

Those. the number of replicas should be increased ~ 2 times and make up [2 * 1.96] = 4.

Conclusion: As you can see, in order for this mechanism to work, a prerequisite is including the availability of requests for all containers in the observed hearth.

The mechanism of horizontal autoscaling nodes (Cluster Autoscaler)

In order to neutralize the negative impact on the system during bursts of load, the presence of a tuned hpa is not enough. For example, according to the settings in the hpa controller manager decides that the number of replicas needs to be doubled, however, there are no free resources on the nodes to run such a number of pods (that is, the node cannot provide the requested resources for the pod requests) and these pods enter the Pending state.

In this case, if the provider has the appropriate IaaS / PaaS (for example, GKE / GCE, AKS, EKS, etc.), a tool such as Node Autoscaler. It allows you to set the maximum and minimum number of nodes in the cluster and automatically adjust the current number of nodes (by accessing the cloud provider API to order / delete nodes) when there is a shortage of resources in the cluster and the pods cannot be scheduled (in the Pending state).

Conclusion: to enable auto-scaling of nodes, it is necessary to specify requests in the hearth containers so that k8s can correctly evaluate the load of nodes and accordingly report that there are no resources in the cluster to start the next hearth.


Conclusion

It should be noted that setting resource limits for the container is not a prerequisite for the successful launch of the application, but it is still better to do this for the following reasons:

  1. For more accurate scheduler operation in terms of load balancing between k8s nodes
  2. To reduce the likelihood of a hearth eviction event
  3. For horizontal auto-scaling application hearths (HPA)
  4. For horizontal auto-scaling of nodes (Cluster Autoscaling) for cloud providers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *