CPU schedulers in Linux

This article will talk about the basic algorithms that the Linux kernel uses to schedule tasks ready to be executed. How does the priority of tasks and the policy specified for it affect how it will receive processor time and how much.

I would like to immediately make a reservation that in addition to the algorithms and scheduling classes described below, Linux also implements additional algorithms based on system power consumptionon cpu capacity etc. They are built into classes, which we’ll talk about below, but we won’t touch on them themselves in this article.

Concepts of task, thread and process

Let's start with the base. Task, thread and process are, in principle, quite close in meaning concepts, but you still need to figure out what each of them means.

When you run an application, a process is created in the system, which is allocated memory to store executable code and other data. As part of this process, at least one thread is created. It is among the threads that Linux schedulers distribute processor time and also decide in what order they will be executed. A task can be called the instructions themselves that the processor receives and executes. Further in the article, I will from time to time call threads tasks or processes, because within the context of the article they are kind of synonyms.

It is also important to know that processes have a field state, which, as the name suggests, shows what state the process is currently in. There can be several states:

  • Running or Runnable (R) – a process in the running state is using CPU time right now, while a runnable process is just waiting for it.

  • Uninterruptible Sleep (D) – the process waits for the result of an I/O operation. For example, it waits for data from the disk. There is no way to kill such a process.

  • Interruptible Sleep (S) – the process waits for some event and does not consume CPU time. Example: command line input.

  • Stopped (T) – the process received a signal SIGSTOP or SIGTSTP and stopped his work. In this state, a signal can be sent to the process SIGCONT and run it again.

  • Zombie (Z) is a process that received signals to shut down and completed it completely – it no longer consumes system resources, only the PID identifier. It waits for its parent process to make a system call wait, after which Zombie will disappear. Sometimes this may not happen, in which case you have to debunk the parent.

All information about the process can be obtained from the folder /proc/<PID>Where <PID> is the process identifier. If you need to receive information in a more “user friendly” form, then use the utilities top, htop, ps etc.

Politicians and classes

Each task has some kind of policy that determines exactly how to plan the execution of this task and what algorithms to use for this. In turn, classes are, in essence, implementations of these policies. Not every policy has its own personal class, but they are all implemented within some class. For example, in rt_sched_class 2 policies are implemented: SCHED_FIFO And SCHED_RRand in dl_sched_class – only SCHED_DEADLINE.

The policy of a task is determined by its priority sched_priorityit can be found in the file /proc/<PID>/schedstat. Priorities are mapped to classes as follows:

  • sched_priority == 0 – class “Deadline”

  • sched_priority >= 1 && sched_priority <= 99 – class “Real-Time”

  • sched_priority >= 100 – class “Fair”

Class for “ordinary” tasks

For the distribution of processor time for tasks with policies SCHED_OTHER, SCHED_BATCH And SCHED_IDLE answers fair_sched_classdescribed in the file kernel/sched/fair.c. It used the “Completely Fair Scheduler” CFS algorithm from 2007 to 2023 (from version 2.6.23 to version 6.6), now it is the “Earliest Eligible Virtual Deadline First” EEVDF.

For “regular” tasks, the priority field is ignored, instead there is NICEwhich plays a role in planning.

NICE – this is an indicator of the thread’s tolerance to other streams :). The higher this value, the more willingly the thread will share processor time or its queue (depending on the scheduler, more on this below). The minimum value is -20, the maximum is +19.

About politicians:

“Completely Fair Scheduler” CFS

As the name suggests, the main goal of the CFS scheduler is to distribute CPU time as fairly as possible among all processes requiring it. CFS is as the linux documentation tells usa simulation of the “ideal multitasking CPU,” which, if it existed, would allocate CPU time something like this: two runnable process, each gets 50% of the CPU time.

Everything in CFS runnable processes are stored in a red-black tree, which is sorted by vruntime process, which is the amount of CPU time in nanoseconds that the process has used. Thus, it turns out that on the left side of the tree are the processes that used the least time, and on the right side are those that used the most. And when CFS chooses which task to give access to CPU time, it chooses the leftmost task in the tree, that is, the one that received it the least.

Example of a Red-Black tree

Example of a Red-Black tree

Since any operation on a red-black tree has a complexity of O(log n), the CFS algorithm itself has the same complexity. However, CFS uses a slightly modified implementation of this tree, which also stores a pointer to the leftmost node. Therefore, to find the process with the smallest vruntime will be O(1).

When a task gains access to CPU time, it gets it for a predetermined amount of time called slice. It can increase and decrease depending on NICE process values. Thus, for example, the smaller the value NICEthe more CPU time a task will receive compared to tasks that have a higher value.

Even though CFS did a really good job of distributing CPU time “fairly” to everything runnable processes, it also had serious shortcomings. One of them is the inability to properly handle processes that require CPU time as quickly as possible. In order to make CFS friends with this story, the so-called latency nice patches. They added new meaning latency-nicewhich was a mirror image of the existing one nice value, in the sense that it also had a limitation from -20 to +19 and its reduction was the same as that of nice, decreased tolerance to other processes. If the new process latency-nice was lower than that of the one currently running in the processor, then it could replace it. Thus, tasks with low latency-nice did not receive more processor time, but were entitled to faster allocation.

“Earliest Eligible Virtual Deadline First” EEVDF

Peter Zijlstra countedwhich solves the same problem that they solve latency-nice patches for CFS, there is a better way. He suggested that using the EEVDF algorithm, introduced back in 1995may be more suitable for this and will save the scheduler from a bunch of “icky heuristics code” (I believe that in Russian this means “shit code”).

By the name of this algorithm you can roughly guess how it works :). He assigns to the processor the task that “has the right to do so” and which has the earliest deadline. In principle, that's all. All that remains is to figure out what “has the right” actually means, and how this mysterious deadline is calculated.

Let's start with the “eligible” parameter. Has the right or does not have the right is determined by the function vruntime_eligible in file kernel/sched/fair.c, in the commentary to which it is written: “Entity is eligible once it received less service than it ought to have, eg. lag >= 0”. That is, a task is “eligible” if and only if it has received less CPU time than it should. An example is the inequality lag ≥ 0which actually means this, because lag is the difference between the time that the task should have received and what it has currently received. How is this calculated? We take the average vruntime the whole line and take it away from him vruntime tasks. It is important to understand that the average vruntime queues and vruntime we consider a separate task, taking into account the weights of the tasks, which are their NICE meanings. Thus, thanks to the “eligibility” mechanism, time is “fairly” distributed among all processes.

Now let's move on to the concept of “deadline”. Deadline is the earliest time by which the task must receive slice of your processor time. It is calculated by the formula: vruntime + (slice / weight)respectively, the greater the weight of the task (depending on NICE), the earlier its deadline.

All runnable processes, just like in CFS, are stored in a red-black tree, but it is not sorted by vruntime, but according to the deadline. The tree itself is “augmented”, that is, supplemented, and contains in its nodes not only the deadline of the process, but also the rest of its data (in CFS as well), including min_vruntimewhich is calculated by the formula se->min_vruntime = min(se->vruntime, se->{left,right}->min_vruntime). Thus, it becomes possible to filter the tree by “eligibility” and then select the leftmost node, that is, the process with the closest deadline.

Class for Real-Time tasks

rt_sched_classdescribed in the file kernel/sched/rt.cdistributes processor time for tasks with policies: SCHED_FIFO, SCHED_RR – both of these policies are described in the POSIX standard. Meaning NICE It doesn’t play a role in tasks, only priority is taken into account.

SCHED_FIFO (FIFO – First In First Out) takes the task with the highest priority and sends it to the processor for execution until it blocks, finishes its execution, or is displaced by a higher priority task.

For SCHED_RR (RR – Round-Robin) everything is true as for SCHED_FIFO, except that it does not allow tasks to be executed in the processor “until the end”, limiting the time of their execution. Thus, a task whose execution time is greater than or equal to its allotted time will be sent to the end of the queue for tasks with the same priority.

It is important to note that since Real-Time tasks have a higher priority than regular ones, this leads to the fact that any new runnable Real-Time task displaces the regular one running task (if there is one, of course).

The user, in order to do any “Real-Time” task, needs to use the utility chrt.

chrt -f 10 -p 2445 # Выставляем задаче с PID 2445 
                   # политику SCHED_FIFO и приоритет 10

chrt -r 10 your_command # Выставляем задаче, выполняющей вашу команду, 
                        # политику SCHED_RR и приоритет 10

Class for Deadline tasks

dl_sched_classdescribed in the file kernel/sched/deadline.callocates CPU time for tasks with a policy SCHED_DEADLINE, using two algorithms in combination: EDF “Earliest Deadline First” and CBS “Constant Bandwidth Server”. In this class, tasks have priority ignored and NICE values, all tasks, thanks to CBS, are performed “isolated” from each other, that is, they do not intersect.

Deadline tasks have the highest priority among all other tasks in user space. Thus, they can replace even Real-Time tasks with the highest priority.

Within this class, the task has 3 parameters: runtime determines how much CPU time the task will receive, period determines for what period he will receive it, and deadline, which is calculated using CBS, sets the cutoff timestamp until which a task can be executed, so that it does not overlap with other tasks. The EDF algorithm, in turn, uses the value deadline to understand which task to assign to the processor, always choosing the process with the earliest deadline.

An ideal (not real) example of how the algorithm works from the article https://lwn.net/Articles/743740/

To set a “deadline” policy for a process, the user, as in the case of “Real-Time” policies, needs to use the utility chrtbut now, in addition to the priority, which is always 0 for “deadline” tasks, it also needs to be specified in nanoseconds period, runtime And deadline.

# В этом примере задача будет иметь гарантированные 5мс процессорного времени
# с периодом 16.6мс и с дедлайном 10мс
chrt -d --sched-runtime 5000000 --sched-deadline 10000000 \
    	    --sched-period 16666666 0 your_command

Sources​

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *