Radical asynchrony

Asynchrony… So much in this word! Asynchrony is considered one of the most difficult topics in programming in general. Not everything is simple in this matter. Various articles and tutorials on various asynchrony issues periodically appear on the Internet. From the latest, I consider this material to be a useful article. Actually, this article served as the last straw that overwhelmed my patience. I made an attempt to look at asynchrony, the problems and difficulties associated with it, from a completely different angle. The conclusions, I would say, are quite radical. And so, let's go!

The source of our headaches is I/O operations. These are, as a rule, slow operations that proceed much slower than the speed of our programs. Our fast processors are forced to wait for slow I/O operations to complete, forced to sit idle instead of performing useful calculations. As a result, the performance of our programs is much lower than it could be.

The essence of the disadvantage of the synchronous approach

In the synchronous I/O programming approach, we always wait for the current operation to complete. Yes, a modern processor is idle, but we have simple and clear program code. We use simple and intuitive input and output functions such as read() or write(). Execution of the program (thread) is blocked until the read() function reads data from the network socket and returns control to us when the data has already been read. After returning from the function, we can begin processing the received information.

Let's look at a classic example usually considered in such cases. We are writing a network service that receives requests from clients, processes the request, say, looks up information in a database, and sends a response to the client. In synchronous style, such code could be like this (pseudocode):

auto request = socket.read(...);
auto result = db_client.do_query(prepare_request(request));
socket.write(prepare_result(result));
socket.close();

This is very good, easy to read and visual code! What else does?

Usually in such cases they write about performance. And so it is. What will happen if there are several tens, hundreds, thousands of clients? In this code, clients will have to wait their “turn” while requests from previous clients are processed. But now no one does that anymore. Typically, a separate thread is created for each new client (incoming connection). And each client is “led” in a separate thread. This separate thread may be blocked on some I/O operation. But, at first glance, there is nothing wrong with this. But what will happen if there are really a lot of clients? Thousands and tens of thousands? The same number of separate threads will be created.

Usually, authors of articles about “asynchrony” also write here about performance problems. They say that our processors have a limited number of cores, they will choke on context switching between thousands and tens of thousands of threads. The context switch operation is a very expensive operation, usually requiring a transition to the protected mode of the operating system kernel, including the operation of the operating system thread scheduler, etc. This is all correct.

As a rule, the disadvantage is that we have thousands of threads, most of which are idle, waiting for the current I/O operation to complete. Personally, I don’t see any problem here. So what if we have thousands of threads, of which 90% are idle? After all, the remaining 10% of threads more than load our processor and all its cores, performing useful work. What is the problem?

The problem is precisely the expensive context switching operation when switching threads. Too much CPU time is spent not on useful work performed inside the thread, but on switching between threads.

Also, the problem can be caused by the peculiarities of the operating system thread scheduler and the peculiarities of organizing preemptive multitasking. For example, if a thread is given a minimum time slice, say, calculated dynamically, in the range from 1 to 10 milliseconds, as in Linux, but the thread, after 30 microseconds of active work, initiated an I/O operation in blocking synchronous mode, then the remainder of this The time quantum will be wasted if the I/O operation lasts more than 10 milliseconds. And here asynchronousness comes onto the scene.

Asynchrony

In asynchronous mode, we do not wait for I/O operations inside the read() and write() functions to complete; instead, they return control almost instantly. We are able to perform some useful computational work while our input and output occurs. After some time has passed, we return to our pending I/O and process its results. We only need to have one or more threads to handle thousands of incoming connections! We always get useful work done, improving overall performance without wasting precious CPU cycles switching between threads.

But the main problem is the readability and understandability of the code. We read from the socket in one place, process data from it in another. When using callbacks it is very easy to get confused, because the code becomes like spaghetti of them. This situation is described by the term “callback hell”. There are several options in C++. And none of them are good.

One is an approach taken from the C language, where each callback is a function pointer, and we set up callbacks to handle each event that occurs. There are quite a lot of C libraries that follow this ideology, and this approach works. The disadvantage is the inconvenience of programming and the verbosity of this approach.

The second approach is to use C++ language interfaces, classes with abstract virtual methods, with names like OnData(). The programmer must create his own class that implements such an interface, and pass the pointer to the interface to the asynchronous function. This approach seems to be less common than the first, but it works. Its disadvantages are the same as the first one – verbosity and inconvenient programming.

The third approach is similar to the second, except that instead of abstract interfaces, a new feature of the C++ language is used, these are lambdas. In fact, lambdas are syntactic sugar, but they reduce verbosity.

The main problem of all three approaches to implementing asynchrony is the logical complexity of the code; the programmer must mentally monitor the sequence of callback calls, the lifetime of all variables needed to process I/O results, and the lifetime of captured variables in lambdas.

The fundamental point is that it is generally not possible to use local variables on the stack to process the results of the current I/O operation. For example, we started a read operation from a socket and passed it a buffer to store the read results. However, since the asynchronous read() function will return immediately, and the result of the operation will be available somewhere else where we do not have access to the local variables of the current scope, we need to think with our heads and figure something out. The use of lambdas does not allow you to forget about the lifetime of captured objects, and requires the use of gray matter in the programmer’s head, and brain activity, as is known, is the most energy-consuming in human life.

All of the above makes asynchronous programming much more difficult than synchronous programming and requires a higher qualification of the programmer. In this case, programming productivity also drops if you evaluate it using the parameter – the amount of implemented functionality per unit of working time.

Salvation in coroutines

The newest approach to implementing asynchronous programming is the use of so-called coroutines. The authors claim that coroutines are almost silver bullet, solving asynchronous programming problems. They combine the advantages of the ease of programming of the synchronous approach and the high performance of the asynchronous approach.

But for this you need to have support for coroutines in the programming language. C++, starting with the C++20 standard, has basic support for coroutines. You also need to be able to use coroutines and use them in your code.

New keywords have been introduced for coroutines in C++: co_return, co_yeild, co_await. I will not go into details of coroutine syntax; you can do this using third-party literature. In short, a coroutine is any function (well, almost) that uses one of the keywords from co_return, co_yeild, co_await. The code for the network service above, written using coroutines, could be like this (pseudocode):

auto request = await socket.read(...);
auto result = await db_client.do_query(prepare_request(request));
await socket.write(prepare_result(result));
socket.close();

The compiler will make it so that this function can remember its state by interrupting its execution for another useful function, and then, after the occurrence of an event, for example, the completion of a read operation from a socket, restore its state.

The compiler will generate a state machine (state machine) inside our function, so that re-entering our function will jump directly to the desired line from which we need to resume work, for example, data from the socket has already been read and we need to continue executing the function from the place of preparation query for the database. Historically, such a state machine is called Duff's Device.

Thus, we have so many state restoration points, so many times we used the await keyword. After each await, we remember the current state of the function, exit it, perform some useful work until the data from the asynchronous operation is ready, then re-enter the function and go to the very state in which we left our function to continue its work.

Thus, in one current thread we perform several functions at once, saving on switching between threads. Such functions, executed several at once on one thread, are called coroutines. Please note that coroutines are not executed in parallel, but sequentially within one thread. First, one coroutine is executed, up to the co_await keyword, then another coroutine is executed up to the co_await keyword, and so on. Consistently.

The main problem with coroutines lies in the variables on the stack. After all, every time we exit the function, the stack is cleared and the variables on the stack are deleted! And when entering the function the second and subsequent times, we need to restore the variables on the stack. How this is technically implemented determines the type of coroutine.

There are several implementations of coroutines. These are the so-called stackfull– coroutines and stackless. The stack is an area of memory pointed to by a processor register, usually called SP (stack pointer). And the stack is usually very limited in size, for example, the default stack size in Linux OS is 2 megabytes per thread. Stackfull-coroutines simply remember the current value of the SP register, set it to a new memory area, and transfer control to another coroutine, while ensuring that the old stack memory area remains intact. And when the old coroutine is restored, the value of the stack pointer register is simply restored. This is a fairly low-level approach that requires knowledge of the specifics of the current architecture.

Stackless-coroutines don't actually use the stack for local variables. The compiler has the ability to analyze the body of the coroutine, determine the memory size for all local variables of the coroutine, and allocate a memory block of the required size on the heap. And then save or restore a pointer to the current block of local variables, called the coroutine frame.

There are certain rules for creating your own objects that can be used to the right of the co_await keyword in C++. So-called Awaitable objects must have the following three methods: await_ready, await_suspend, await_resume. There are certain nuances in the implementation of these methods. The author will not go into details of the implementation of these methods. However, I would like to note that even using the co_await, co_return keywords, creating your own Awaitable objects and implementing the await_ready, await_suspend, await_resume methods requires some effort from the programmer.

The essence of coroutines

In the synchronous case, we have many threads; the operating system, through the mechanism of preemptive multitasking, performs a context switch, essentially switching the execution of threads on the processor cores. Yes, the context switch operation is an expensive operation, since it requires switching to the kernel mode of the operating system. An algorithm is used that dynamically calculates the minimum slice of processor time allocated to a thread, depending on its priority, and a part of the operating system called the multitasking manager, or thread manager, implements such an algorithm. On multi-core processors, some threads will actually be executed in parallel, some sequentially, by sequential context switching.

In the case of coroutines, several coroutines can be executed on one operating system thread. At the same time, switching between coroutines is done quite quickly, since we do not need to switch to the protected mode of the operating system kernel, but everything is done in user space. And coroutines are executed sequentially, using the so-called cooperative multitasking, when the currently running coroutine itself determines when it can be suspended and control is transferred to the next coroutine. Does this remind you of anything?

In fact, coroutines are an attempt to transfer part of the operating system – the thread scheduler – to user space. Technically, instead of threads, we have coroutines, which are executed in “multitasking” mode. In preemptive or cooperative multitasking mode, this is not so important now. There is also a scheduler ~~streams~~ corutin. All these manipulations with the SP register in stackfull-coroutines, this is, in fact, the implementation of the processor context switching mechanism “at minimum”. The Duff's Device state machine implementation is actually an emulation of the IP register save and restore (instruction pointer) that is used in processor context switches.

Avoiding the use of stack memory in stackless-coroutines lead to unnecessary fragmentation of memory on the heap and slower allocation of such memory, since memory allocation on the real stack is a very fast operation. And all for the sake of facilitating the process of switching coroutines, to facilitate the implementation of the “multitasking” manager.

Philosophical question

Why, when trying to transfer the multitasking scheduler from kernel space to user space, are we forced to resort to various perversions and tricks, introducing various new language tools, keywords, changing the approach to programming and, even, the programming philosophy itself?

With this question posed, why does some low-level detail, like the scheduler or OS thread manager, force us to change our programming system and approach in general? Maybe something is wrong here?

Taking the idea to the point of absurdity, for example, why, when transferring our program from Windows OS to Linux OS, which, say, have different implementations of the thread manager, do we practically not change the source code of the program? But we would have to change the very approach to programming, say, by switching from an object-oriented style to functional programming and completely rewriting the program. Well, this is, of course, sarcasm.

After Coroutines

I'll make a bold proposal. What could be the next stage of asynchronous programming, after coroutines?

Synchronized! Let's throw out all Awaitable, co_await, lambdas, callback interfaces and function pointers for callbacks. These are all unnecessary things that complicate programming. We should concentrate on applied things, and not deal with such tasks as organizing some kind of multitasking. It is the operating system or language that should provide this.

Let’s say we have a ready-made and well-debugged program, written in a synchronous style in a multi-threaded mode. We can simply recompile our program to work with the multitasking manager in user space and our program will become “asynchronous” under the hood.

The implementation of the POSIX Threads library can create so-called “coroutines” instead of real OS threads, and execute our program on a fixed number of threads according to the number of available cores in the processor. The entire inter-thread synchronization infrastructure needs to be updated to work in coroutine mode under the hood. It’s not for nothing that some companies are already creating their own mutexes tailored for coroutines. For example, Yandex in its userver. And in general, it turns out that it is necessary to re-implement all inter-thread synchronization primitives, including condition variables, mutexes, semaphores for coroutine. I have a strong feeling that we are going somewhere wrong. Why not use the same synchronization primitives from the std namespace, but with a different implementation under the hood, rather than cutting your own?

A special issue is the generation and placement of co_await under the hood of such a target compilation platform. In preemptive multitasking mode, we don't think or worry about when our program might be suspended in execution and the processor will be transferred to another program. Why should it be different in the mode of cooperative multitasking implemented with compiler support?

Let the compiler set co_await itself. Questions arise about what algorithm should be used to place co_await and what heuristics should be used. I hope that a solution can be found for this technical issue as well. Also, co_await points can also be cancellation points to cancel the execution of the current “virtual thread” (which is actually implemented using a coroutine). But the application programmer will not know about this, that his current thread is not real at all, but is based on a coroutine.

Write synchronous code, it's simple and clear! And under the hood in the future there may be cooperative multitasking with asynchronous functionality. This is a radical solution to the problem of asynchronous programming.