Weak memory models: write buffering on x86

about the author

Anton Podkopaev is a postdoc in MPI-SWS, the leader of the weak memory models group at language instruments laboratories JetBrains Research and lecturer at Computer Science Center.

Back in 1979, Leslie Lamport in the article “How to make a multiprocessor computer that correctly executes multiprocess programs“Introduced, as the name suggests, the idealized semantics of multithreading – consistent consistency model (sequential consistency, SC). According to this model, any result of executing a multithreaded program can be obtained as a sequential execution of some interleaving of instructions of the threads of this program. (Interleaving is assumed to preserve order between instructions that are on the same thread.)

Consider the following program SB:

This program has two streams, in each of which the first instruction is an instruction to write to a shared location (x or y), and the second is a read instruction from another shared location. There are six thread instruction interlaces for this program:


In this and the following examples, we assume that the shared locations are initialized to 0. Then, according to the SC model, this program has only three possible execution outcomes: [a=1, b=0], [a=0, b=1] and [a=1, b=1]…
The first of them corresponds to the sequential execution of the first instruction interleaving, the second to the second, and the third to the remaining four. No other results of program execution SB, in particular [a=0, b=0], SC models are not allowed.

The latter means that correctness is guaranteed for the simplified version of Decker’s mutual exclusion algorithm given below, i.e. there is no execution of this program in which both critical sections are executed:

Let’s check that in practice everything is the same. For this we are implementing the program
SB-impl in C ++:

 1: #include <thread>
 2: #include <iostream>
 3: 
 4: using namespace std;
 5: 
 6: int x, y, a, b;
 7: 
 8: void thread1() {
 9:  cout.flush();
10:  x = 1;
11:  a = y;
12: }
13: 
14: void thread2() {
15:  y = 1;
16:  b = x;
17: }
18: 
19: int main() {
20:  int cnt = 0;
21:  do {
22:    a = 0; b = 0;
23:    x = 0; y = 0;
24: 
25:    thread first(thread1);
26:    thread second(thread2);
27: 
28:    first.join();
29:    second.join();
30:    cnt++;
31:  } while (a != 0 || b != 0);
32:  cout << cnt << endl;
33:  return 0;
34: }

Here’s an analogue SB runs in a loop (lines 21-31) until a result is received [a=0, b=0]… If it is received, the number of the corresponding iteration will appear on the screen (line 32). (Why thread1 on line 9 needs the cout.flush () command will be described below.)
From the point of view of the SC model, this program should not terminate, however, if you compile this code with GCC and run it on an x86 machine,

g++ -O2 -pthread sb.cpp -o sb && ./sb

then you can see on the screen, for example, 23022 or 56283 – the author obtained such results by running this code on his computer.
This example shows that the idealized SC model does not describe the real state of affairs. Why is that? Where did the program come from SB the result is taken [a=0, b=0]? In fact, there are two reasons: compiler and processor optimizations. So, the compiler, in our case – GCC, can notice that the instructions in the left (similarly in the right) program stream SB are independent of each other, and, as a result, can rearrange them:

For this program [a=0, b=0] is correct from the point of view of the SC model. And this rearrangement is really happening! In order to verify this, you can look at the assembler code that GCC generates for the function thread1:

g++ -O2 -c -S sb.cpp

1: ...
2: call    _ZNSo5flushEv@PLT
3: movl    y(%rip), %eax
4: movl    $1, x(%rip)
5: movl    %eax, a(%rip)
6: ...

Line 2 reads from the variable y and the value is written to the register eax, on line 3 there is a write to the variable x values ​​1, and on line 5 the value from the register eax is written to a variable a

Why call cout.flush ()

Now it’s time to explain why a function call is needed cout.flush()… As mentioned earlier, since the entry in x and reading from y are independent, then the compiler, in our case – GCC, may decide to rearrange them. However, it is not required to do this: for example, GCC does not perform a permutation for a function thread2… In order to do the swapping, the compiler must assume that the code after the swapping will be more efficient. And, as practice has shown, the function call cout.flush() makes GCC think the instructions should be rearranged. At the same time, it is not necessary to use this particular function – a function is enough, a call to which GCC will not remove as useless. So, printing is not an empty line cout << " " will do, and calling arithmetic functions sqrt() and abs() without using their result will not work.

Note that cout.flush() does nothing in the program SB-implalthough GCC cannot output this on its own: cout.flush() flushes the output buffer to the console, however on every call cout.flush() the buffer is empty because the program only writes to the console at the end (line 34).

There is a way to explicitly prevent the compiler from rearranging read and write instructions in functions thread1 and thread2… To do this, it is enough to insert the compiler memory barrier asm volatile("" ::: "memory") between instructions (in this case, adding an instruction cout.flush() doesn’t change anything):

void thread1() {
  x = 1;
  asm volatile("" ::: "memory");
  a = y;
}

void thread2() {
  y = 1;
  asm volatile("" ::: "memory");
  b = x;
}

However, if you run the program with compiler barriers, you can still get the result [a=0, b=0], since, as mentioned earlier, not only the compiler, but also the processor can cause the appearance of results that go beyond the SC model – the result [a=0, b=0] for the program SB is not only permitted by the x86 architecture specification, but also seen on most x86 processors.

In order to understand how a similar result is obtained on the processor, you need to refer to memory models architecture, i.e. to its formal semantics. The x86 memory model is called TSO (total store order). TSO allows execution of programs outside the SC model, in particular program execution SBending with the result [a=0, b=0]… Such performances are called weakas well as the memory models that admit them. All major processor architectures (x86, ARM, RISC-V, POWER, SPARC) have exactly weak memory models.

The TSO model can be schematically depicted as follows:

Under this model, threads do not directly update main memory when they execute a write instruction. Instead, they send a request to the local write bufferfrom which it enters the main memory after a while. When reading, the thread first checks its write buffer for the presence of requests to the corresponding location and, if there are none, it accesses the main memory.

Consider the execution of the program SB in a TSO model that ends with the result [a=0, b=0]… At the beginning of execution, the state of the program and memory looks like this:

P of both threads point to the first instructions, all locations in memory are initialized to 0, and registers a and b not defined. Next, the left thread executes a write instruction to the location x, and the corresponding request goes into the write buffer:

Next, the right thread can perform a similar step:

Now the right stream is executing the read instruction from the location x… Since there is no request to write to this location in its buffer, the stream receives a value from the main memory, assigning the corresponding value 0 to the register b:

Similarly, the left thread assigns the value 0 to the register a:

After that, the request from the left stream buffer enters memory:

And then a request from the right stream buffer:

So, in the TSO model, we get a weak execution for the program SB… By the way, the name of the program SB is an abbreviation for store buffering – the effect observed in its weak performance.

However, there is a way to prevent the result [a=0, b=0] for the program SB-impl, and hence implement the Decker algorithm on the x86 architecture. To do this, you need to add the mfence processor memory barrier to the program – a special x86 instruction – which, like the previously used compiler barrier, prohibits GCC from rearranging instructions around it, but also requires, during its execution, already on the processor, that the write buffer of the corresponding stream is empty:

void thread1() {
  x = 1;
  asm volatile("mfence" ::: "memory");
  a = y;
}

void thread2() {
  y = 1;
  asm volatile("mfence" ::: "memory");
  b = x;
}

So, reading a: =[y] in the left stream of the corrected program SB-impl can only be performed after recording [x] : = 1 updated main memory. A similar statement is true for reading b : = [x] from the right stream. As a result, the result [a=0, b=0] becomes impossible.

Conclusion

We have considered an example of a multithreaded program that can be executed on the x86 architecture in an unexpected way – to demonstrate the so-called weak performance that goes beyond the usual SC model. This execution of the program is the result of both compiler optimizations and optimizations used in the x86 processor family.

In the future, if the topic of weak performances and memory models arouses interest, the author plans to write a series of posts on this topic.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *