How, wandering through Stack Overflow, you can come across the Branch predictor

On the blog beeline cloud we shared stories and opinions of developers – as a self-taught programmer learned 30 programming languages, in which cases pair programming doesn't work And Why do some projects fade away?when the development team lead leaves the company. Today we’ll talk about how to learn the basics of computer science using Stack Overflow, even if your education is not related to computer technology. Here's an interesting translation.

My main education is far from IT. But around 2016, I figured out how to learn the basics of computer science using Stack Overflow. This is how I got an exciting hobby. In my free time, I browse the site looking for questions, who received the most votes.

I described the method itself, as well as the results it brings, in my article 16-hour training for developers. What are its main advantages? The fact is that this approach is often much better than reading textbooks while studying at a university.

The top-voted answer on Stack Overflow will give you first-hand experience from an industry expert. And this is combined with a brief theoretical background (which good specialists usually provide in their answers).

I want to dedicate this article to one of the questions with the most votes on Stack Overflow (currently about 24 million upvotes!).

First meeting

When I came across this question, I couldn’t believe my eyes. At first I even decided that I had misunderstood him.

All right, I thought, another asterisk question typical of FAAMG interviews.

Nothing new. The benefits of searching a sorted array are well known. And the famous binary search has completely become the talk of the town at programming interviews.

But after looking at two small code snippets available via the link, I was seriously puzzled. The code itself (loops in C++ and Java) was quite simple. This is something I could have written in college even without the help of a programming teacher.

// !!! With this, the next loop runs faster.
   std::sort(data, data + arraySize);

   // Test
   clock_t start = clock();
   long long sum = 0;
   for (unsigned i = 0; i < 100000; ++i)
   {
       for (unsigned c = 0; c < arraySize; ++c)
       {   // Primary loop.
           if (data[c] >= 128)
               sum += data[c];
       }
   }

Therefore, I had no difficulties understanding the code. But, nevertheless, the code itself did not fit into any of the models familiar to me.

I had to read it three times, resisting the urge to scroll to find out the answer. But something still eluded my understanding. What kind of array processing is this? Iterating element by element? Very similar to regular search, and not the fastest.

Only after some time did I realize that I was seeing a completely different type of search, and certainly not binary. Each element here was processed in a special O(N) order. The loops did not depend on the programmer's mathematical assumptions, such as:

if value < maxElement/2
  search for value in (0...maxElement/2 - 1)
else
  search for value in (maxElement/2...maxElement)

What made it possible to speed up the processing of a sorted array?

Brief answer to the question

In this case, the trick is not in the program, but in the machine that executes it. The rating issue with StackOverflow is more concerned not with the software part, but with the hardware capabilities of the platform.

Best answer it was compared to a railway junction. Before you read further, remember the basics of this analogy:

Node = comparison statement (If statement)

Branch = result of executing a JUMP instruction or some variation thereof

(if-else/while control flow)

Imagine a blind switchman from the early 1800s. Normally he would have had to stop the train at a junction. Then ask the driver which line he should take next. After receiving the necessary information, he switched the arrow.

The method is extremely ineffective, given the loss of time and fuel. To optimize it, such an operator had only one choice. Turn the switch, based on your own assumptions, before the train arrives at the junction.

If he were an oracle, the train would never have to stop.
But if the prophecies misfire, the driver will have to stop moving and reverse.

However, if the movement schedules are predictable, this model will have a high chance of success (first option). Now let's look at this scenario in real programming.

When processing an array containing a comparison (if currentElement <= or == or ≥ value), the processor uses a clever trick.

Let's say the loop is executed N times. At each k-iteration (0 < k < N-1) of processing the array, it will try to “remember” what happened in previous iterations from 0 to k-1. Data usage depends on the complexity of the processor design, so let's leave that aside for now.

If a condition (eg currentElement == value) was true most of k-1 times, it is assumed to be true kn times. And vice versa.

Based on the fact that the counter values will definitely be less than 30 (based on previous ones such as 17, 18, 19…), why wait until 20 (or 21, 22 or 23)?

You just need to be guided by the fact that they are less than 30.

What exactly is Branch predictor – a module for predicting transitions (predicting branches)

And if he's so good, what's still wrong with him?

“Bottleneck” by von Neumann

The processor executes instructions. These words mean all the calculations he makes (c = a + b) and comparisons (i == 100). But this is only a small part of his work. However, this is what we will talk about now.

Note: to execute each instruction (in assembly language), the processor also solves other related tasks:

Sample: instructions and operands (a, b, c, +, = or the processor equivalent of an IF instruction) are retrieved.
Decoding: converting memory addresses of corresponding spaces as the CPU executes multiple programs in multiple contexts.
Performance: is familiar to us as a programming language operator (for example, c = a + b).
Reverse entry: changing data/program memory with the result obtained, for example filling variable C with the value A + B.

The combination of these four stages is known as a 4-stage processor instruction cycle.

However, not everything is as good as it might seem at first glance. The processor operates in clock cycles. But memory cannot provide instructions as fast as the processor can process them. The gap between processor execution speed and memory bandwidth is known as von Neumann bottleneck. The name is given in honor of the inventor of the first computer architecture, known as Princeton model.

In other words, executing one assembly instruction takes at least N (>1) processor clock cycles, where N is the length of the instruction pipeline.

Of course, this is just an extremely simplified explanation. 4-stage diagram command pipeline (N=4) is given below.

4-stage command pipeline - Wikipedia — 4-stage command pipeline − Wikipedia

This is what a 5-stage (N=5) version of the command pipeline looks like:

5-Step Instruction Cycle - Wikipedia — 5-Step Instruction Cycle − Wikipedia

Empty (or crossed out) slots indicate that the processor is free, waiting for program instructions to arrive from memory. An unused slot means wasted capacity, latency, and unnecessary waste of expensive processing power. Now multiply that by the processor clock speed and you get an idea of the performance loss.

To somehow reduce the severity of the problem, computer developers have come up with many strategies.

Almost all of them are familiar to us:

CPU cache
Cache division between data memory and program memory
System on a Chip – Apple's M1 product line is its descendant

With the exception of the branch predictor module.

How does the transition predictor work?

Branch predictor is a probabilistic strategy aimed at reducing the number of missed processor cycles due to their repetitive nature.

The module is implemented at the hardware level. It is essentially a digital circuit that is most easily implemented using a flip-flop.

The predictor memory stores the history of the choice of execution path during iteration within the for/while loop.

The more complex a branch predictor is, the more information about previous branches it can store in memory. Accordingly, the more accurately he can predict which branch should be selected for execution.

In a loop containing 100 iterations within a sorted array, the IF path is used the first 51 times and the ELSE path is used the last 49 times.

Since the array is sorted, it is not difficult for the branch predictor to cope with the task. The first 51 times, the selection of instructions for the IF path will occur according to the following scheme: the 1st time randomly, and from the 2nd to 51st times based on the accumulated history.

Each time an IF statement is executed, a TRUE result will result in the EXECUTE phase of the IF path. The branch will occur without waiting for the FETCH and DECODE phases (IF branch instructions).

This will continue until the 51st iteration, that is, until the pre-selected set of branch predictor instructions matches the evaluation of the IF operator. In other words, the branch predictor will operate on ready-made executable instructions even before the CPU knows that the IF is TRUE.

When the 52nd iteration is executed, the IF result will be FALSE. The processor will know that its preselected set of instructions needs adjustment. Accordingly, time will be spent fetching and decoding the ELSE instruction set. It will then continue working with this set (doing EXECUTE without FETCH and DECODE) until the 100th iteration.

Starting from the 52nd iteration, the processor enters a repetitive mode of operation. An analogy can be drawn with the railway situation described above. The train driver realizes the mistake of the blind switchman, returns the train to the fork and then follows the right path, not forgetting the heartfelt wishes addressed to the failed oracle.

When processing a sorted array, the penalty is not too severe given the benefit already gained. In 98 cases out of 100 (except for the 1st and 52nd iterations), the prefetching of instructions was still effective.

But this is a simple example. Cycles can be very complex. Although the developers of hardware solutions are not asleep. Advanced transition predictors are being actively developed.

For better understanding, below is a block diagram taken from a technical blog. Its logic is no different from how the cache works.

Transition Predictor Flow - The Beard Sage — Branch Predictor Flow − The Beard Sage

Complex branch predictors can also be built at the software level by placing the execution flow directly in the data structure. You can read about this in the blog at the link (under the diagram).

Conclusion

What does a high ranking of questions or answers on Stack Overflow mean? And how can we explain the enormous popularity of this online resource?

Nowadays, Chat GPT has created a serious alternative to classic forums. In these circumstances, the Stack Overflow rating ranking can be called somewhat controversial. It will still remain an online testament to the solidarity of the developer community, but not to their programming competence.

The effectiveness of knowledge acquisition is evident when related fields are discussed by great scientists and experienced industry experts. The question about faster processing of a sorted array clearly proves this. The answers to this cover not only how programs are executed, but also how platforms are designed. It becomes clear why Moore's law does not allow achieving the desired computing power, despite the increase in processor speed.

Answers to questions that deeply penetrate the essence of concepts are precisely what create its unique aura around the StackOverflow resource. No GenAI AI tool can match the stories and reasoning that volunteer experts themselves bring to the community.

And we deserve special respect for curious and thoughtful developers who, with their questions, provoke educational discussions about the intricate architecture of algorithms!