Important Aspects. Part 1
Recently, on one of the YouTube channels, I discussed in detail the work of Kubernetes Scheduler. In the process of preparing the material, I came across many new and interesting facts that I would like to share with you. In this article, we will look at what exactly happens “under the hood” of Kubernetes Scheduler and what aspects are important for understanding its functioning.
I plan to go from simple to complex, so I ask for your understanding. If you’re already familiar with the basic concepts, feel free to skip the introductory part and jump straight to the key details.
If you ask an ordinary developer, how would you implement the k8s scheduler?
The answer will most likely be something like this:
while True:
pods = get_all_pods()
for pod in pods:
if pod.node == nil:
assignNode(pod)
But this article would not exist if everything was that simple.
What is Kubernetes Scheduler and what problems does it solve?

The Scheduler in Kubernetes is responsible for distributing Pods across worker nodes (Nodes) in the cluster. The main job of the scheduler is to optimize the placement of pods, taking into account the available resources on the nodes, the requirements of each pod, and various other factors.
If you asked me to describe the Kubernetes Scheduler functionality in a nutshell, I would highlight two key tasks:
Selecting a suitable node to host the pod. In this process, the scheduler performs an analysis to ensure that the pod can run efficiently on the selected node.
Attaching the pod to the selected node.
Where is Kubernetes Scheduler located in the Kubernetes architecture?
If you try to depict the sequence of actions that occur when creating a pod, you will get the following diagram:

The image below shows the sequence of steps that occur when creating a pod.

A Pod is created by a Controller, which is responsible for the Deployment and replicaSet state, or directly through the API manually (for example, through kubectl apply).
The scheduler takes a new one to work
Kubelet (not included in the scheduler) on the worker node (Node) creates and runs containers (Containers) of the pod (Pod)
Kubelet cleans up unnecessary Pod data after it is deleted
How does Kubernetes Scheduler work in basic terms?
Information collection: The scheduler constantly monitors the state of the cluster, collecting data about available nodes (Nodes), their resources (CPU, memory, etc.), the current placement of Pods (Pods) and their requirements.
Candidate Determination: As soon as a new pod appears that requires placement, the scheduler initiates the process of selecting a suitable worker node. The first step is to create a list of all available nodes that meet the basic requirements of the pod, such as processor architecture, amount of available memory, and so on.
Filtering: Nodes that do not meet the additional requirements and restrictions specified in the pod specification are removed from the generated list. These could be, for example, affinity/anti-affinity, taints and tolerances rules.
Ranking: After the filtering step, the scheduler ranks the remaining nodes to select the most suitable one for pod placement.
Assigning a Pod to a Node: At this stage, the Pod is assigned to the selected worker node and a corresponding entry is added to etcd. The kubelet then detects the new job and initiates the process of creating and running containers.
Status Update: Once a Pod has been successfully placed, its location information is updated in etcd (which is a single source of data) so that other system components can access this information.
All these steps are called “Extension points” (aka plugins), which allow you to expand the functionality of the scheduler. They are implemented thanks to the Scheduler Framework, which we will talk about in the 2nd part of the article.
For example, you can add new filters or ranking algorithms to meet the specific requirements of your application. In fact, there are many more plugins, we will return to this in the next part.
The process repeats: The scheduler continues to monitor the cluster for the next pod that needs placement.
A simplified version of the planning process is shown in the figure below.
We’ll look at this process in detail in Part 2, but for now let’s look at how the scheduler works in a basic sense.

The image in the diagram is deliberately simplified so as not to clutter. We will look at this process in detail in Part 2.
IN k8s documentation this process has the same structure, but is depicted in a more general form. Instead of informer, an event handler is displayed. Informers use event handlers to trigger specific actions when a change is detected in the cluster. For example, if a new pod is created that needs to be scheduled, the informer event handler will activate the scheduling algorithm for that specific pod.
Informer: The Kubernetes scheduler actively uses a mechanism called “Informer” to monitor the health of the cluster. Informer is a set of controllers that continuously monitor certain resources in etcd. When changes are detected, the information is updated in the scheduler’s internal cache. This cache allows you to optimize resource consumption and provide up-to-date data about nodes, pods and other elements of the cluster.
Schedule Pipeline: The scheduling process in Kubernetes begins with adding new pods to the queue. This operation is performed using the Informer component. Pods are then removed from this queue and go through the so-called “Schedule Pipeline” – a chain of steps and checks, after which the final placement of the pod on a suitable node occurs.
Schedule Pipeline divided into 3 streams.
Main thread(Main thread): As you can see in the picture, the main thread performs the filtering, ranking and pre-reservation steps.
Filter – it’s clear here that all sorts of screening out of unsuitable nodes is happening.
Score – in this plugin, nodes are ranked, i.e. choosing the most suitable node for the pod from all the remaining ones.
Reserve – here there is a preliminary reservation of resources on the node for the pod. This is necessary to ensure that other pods cannot occupy these resources (preventing a race condition). This plugin also implements the UnReserve method.
UnReserve is a method, part of the Reserve plugin. This method is used to release resources on a node that were previously reserved for the pod. This method is called if the pod has not been attached to a node for a certain time (timeout) or the Permit plugin has assigned the deny status to the current pod. This is necessary so that other pods can occupy these resources.
Permit thread: This phase is used to prevent the pod from being stuck in a limbo (unknown) state. This plugin can do one of 3 things:
approve – All previous plugins confirmed that the pod can be run on the node. This means the final decision for the pod is approve.
deny – One of the previous plugins did not return a positive result. This means the final decision for the pod is deny.
wait – If the permit plugin returns “wait”, then the pod remains in the permit phase until the pod receives approve or deny status. If a timeout occurs, “wait” becomes “deny” and the pod returns to the scheduling queue, activating the Un-reserve method in the Reserve phase.
Bind thread: This part is responsible for adding a record that the pod has been attached to the node.
Pre-bind – here are the steps that must be completed before binding a pod to a node. For example, creating a network storage and linking it to a node.
Bind – here the connection between the pod and the node occurs.
Post-Bind – This is the very last step that is performed after binding the pod to the node. This step can be used for both cleaning and additional actions.
Schedule Pipeline also uses Cache to store data about pods.
Important aspect:
IN Main And Permit streams pods are planned exclusively sequentially, in sequence. This means that the scheduler can not schedule several pods simultaneously in the Main thread and Permit thread.
The UnReserve method stands out. plugin Reserve. This method can be called from the main thread (Main thread) or Permit thread, or Bind thread.
This restriction was introduced in order to avoid a situation where several pods are trying to occupy the same resources on a node.
All other threads can execute asynchronously.
Let’s move on to practice and feel everything ourselves
1. Let’s create a new pod
To give the scheduler a job, let’s create a new pod using the kubectl apply command.
Let’s create a pod using deployment.
It is important to note that the scheduler only works with pods, and the state of Deployment and replicaSet is monitored by the controller.
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml
nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80

2. The controller creates pods
What we’ll actually do is create a Deployment, which in turn will create a replicaSet, which will in turn create a Pod.
The controller, which is responsible for the Deployment and replicaSet state, sees the corresponding new objects and begins its work.

The controller will see the deployment above and a ReplicaSet object similar to this will be created
ReplicaSet
apiVersion: v1
items:
- apiVersion: apps/v1
kind: ReplicaSet
metadata:
annotations:
deployment.kubernetes.io/desired-replicas: "3"
deployment.kubernetes.io/max-replicas: "4"
deployment.kubernetes.io/revision: "1"
labels:
app: nginx
name: nginx-deployment-85996f8dbd
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: Deployment
name: nginx-deployment
uid: b8a1b12e-94fc-4472-a14d-7b3e2681e119
resourceVersion: "127556139"
uid: 8140214d-204d-47c4-9538-aff317507dd2
spec:
replicas: 3
selector:
matchLabels:
app: nginx
pod-template-hash: 85996f8dbd
template:
metadata:
labels:
app: nginx
pod-template-hash: 85996f8dbd
spec:
containers:
- image: nginx:1.14.2
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 3
fullyLabeledReplicas: 3
observedGeneration: 1
readyReplicas: 3
replicas: 3
kind: List
As a result of the work of the controller that is responsible for the replicaset, 3 pods will be created. They receive the Pending status because the scheduler has not yet scheduled them to the nodes, these pods are added to the scheduler queue.
3. The planner comes into play
This way it can be displayed on our diagram.

Each pod in the scheduler queue is retrieved in turn order and:
Goes through the Schedule Pipeline, selects the most suitable node
Assigned to the selected node

Planning phase
I may repeat myself a little, but here is a little more detail about the pipeline itself.

Filter – weed out unsuitable nodes.
For example, if we want to place a pod on a node that has a GPU, then we don’t immediately need nodes without a GPU.
Next, we remove nodes that do not have enough resources to run the pod. For example, if a pod requires 2 CPUs, but the node has only 1 CPU, then such a node is not suitable.
etc… There can be quite a lot of filtering iterations, we will return to this in Chapter 2.
Score – sort the remaining nodes.
If there is more than one node, we need to somehow select the most suitable node, and not just use random.
This is where various plugins come into play. For example, the ImageLocality plugin allows you to select a node that already has an image of the container that we want to run. This allows you to save time downloading the image from the container registry.

Reserve – we reserve resources on the node for the pod.
To prevent the resources of our ideal node from being taken away in the next flow, we reserve this node.

Un-Reserve – if something goes wrong at any of the stages, we call this method to free up resources on the node and send the pod back to the scheduler queue.
Permit – we check that the pod can be launched on the node.
If all the previous steps were successful, then we check that the pod can be launched on the node. For example, if we have an affinity rule that says that the pod should be launched on a node with a certain label, then we check that this node matches this rule. If everything is good, then we return the approve status, if not, then deny.
Consolidation phase
In this phase, we perform additional steps before the final fastening of the node, the very fastening of the pod to the node, and the necessary steps after fastening. Read more about this phase in Part 2.
It’s important to note that this thread works asynchronously.

Kubelet – launching a container on a node

Once we have assigned a node to a pod, kubelet sees these changes and begins the process of launching a container on the node. Once again, I note that kubelet is a component not from the scheduler system.
The pod is running on the most suitable node, and we can see this in the output of the kubectl get pods command. This means the scheduler has done its job.
kubectl get pods -o wide
This is what the Schedule Pipeline looks like in a simplified form, and we will look at it in detail in Part 2.
In the next part, we’ll dig deeper and learn more about the inner workings of a planner.
In particular, we:
Let’s look at the Scheduler framework
Let’s find out how to expand the functionality of the scheduler
Let’s pull back the curtain on the scheduler queue
Let’s look at examples of plugins
There will be a link to part 2 here.
By the way, you can learn about Kubernetes Scheduler and more within online lessons from OTUS. By link You can view the course catalog and also register for free webinars on topics that interest you. Upcoming free webinars: