486 is enough for everyone
At the end of the technical interview, if the candidate has answered the questions and completed the tasks, we have time for free questions that can be asked to the team or one of the interviewers. I carried this practice from company to company, and it always helped to defuse the situation or get a person to talk if he was tense during communication. Questions can be any, except personal ones or those under NDA. Usually candidates ask technical questions about the stack, pipelines, sometimes they try to ask tricky questions, especially about the pros, to test us. Sometimes we cannot answer all of them. Google-style questions like “why are tablets round?” — they also meet, but recently at one of the interviews a question was asked to which everyone seemed to know the answer, but no one could immediately give it. The question sounded like this: “What common technologies and solutions have appeared in the processors since the 486 that we often use?»
The question is really interesting – what's new that we use every day? What can modern processors do that processors a year or two ago, five or ten years ago, forty years ago couldn’t? We're just using billions of transistors without even knowing how they work. After digging through Wikipedia, Agner Fog's website, and Intel documentation, I compiled a list of what has appeared and is used in modern processors. Everything below applies primarily to x86 and consoles unless otherwise noted. Since consoles after the third generation of PlayStation are actually PCs with minimal differences, further discussion will mainly focus on PCs. History tends to repeat itself, and much of what we now have was introduced more than once, just under different names.
Miscellaneous
For starters, modern processors have wider registers and can address more memory. In the 80s, 8-bit processors were used, but now almost certainly all PCs and consoles run on a 64-bit processor. I will not dwell on this in detail, assuming that the hacker is at least familiar with programming or related topics.
Other solutions that we use in everyday life, but which have been introduced in x86 since the early 80s, are paging and virtual memory, computation pipelines and support for floating point operations through coprocessors.
Caches
https://en.wikipedia.org/wiki/CPU_cache
One of the most significant technical solutions for everyday programming was the appearance of caches in processors. For example, on the 286, memory access took only a few clock cycles, but on the Pentium 3/4, memory access required more than 400 clock cycles. Even though processors have become significantly faster, memory has not evolved as quickly.
The solution to the problem of slow memory was caches and prefetching data in large blocks from RAM. Caching allows you to quickly access frequently used data, and prefetching loads data into the cache if the access pattern is predictable.
A few ticks versus 400+ ticks looks scary; in fact, it’s a drawdown of more than 100 times. But if you write a simple loop that processes a large block of data, the processor will prefetch the necessary data in advance, allowing you to process the data at a speed of about 22 GB / s on my 3 GHz Intel. When processing two data per clock at ~3 GHz, in theory you can get 24 GB/s, so 22 GB/s is a good result. We lose about 5% of performance when loading data from main memory, not orders of magnitude more.
And that's not all: with known access patterns, for example, when working with arrays and data blocks that fit well in the processor cache and are easily detected by the prediction unit, you can get close to the maximum processing speed. Here it is worth mentioning the famous article by Ulrich Drepper “What every programmer should know about memory” (https://people.freebsd.org/~lstewart/articles/cpumemory.pdf).
TLB (Translation Lookaside Buffer)
https://en.wikipedia.org/wiki/Translation_lookaside_buffer
How many caches does the kernel have? 1-2-3-5? The kernel has many different caches for different tasks, and the main memory caches are not the fastest. For example, there is a cache of decoded instructions, and you are unlikely to ever think about its existence unless you are doing micro-optimization when everything else has already been tried.
There is a TLB cache that is used to look up virtual memory addresses. It is located separately because, even if the data is in the L1 cache, searching for any address will take several clock cycles. That's why there is a cache for searching by virtual addresses, which usually has a very limited size – tens, maximum hundreds of entries. However, searching for it takes one cycle or less due to the use of a separate control unit.
Speculative execution
https://en.wikipedia.org/wiki/Speculative_execution
Another important innovation that changed the approach to software development and performance was support for multitasking and parallel computing at the hardware level. Multitasking on x86 was once done only at the operating system level, but in the 1990s, with the advent of Intel processors with Hyper-Threading Technology (HT), it became possible to run two “threads” on a single physical core.
Thanks to HT and, later, many physical cores in the processor, we were able to execute multiple programs or parts of a program physically in parallel, reducing latency. For example, in games this is used to separate logic, audio processing, physics and rendering, where each core or thread can be busy with its own task. Modern processors, especially in consoles, can contain several specialized cores, which allows for efficient load distribution among threads.
The third curling iron has a Cell BE (Broadband Engine) central processor. This processor is a joint development of engineers from Sony, Toshiba and IBM. The processor has 8 cores. Only seven cores are working, the eighth is additional and is designed to improve performance by distributing power between the remaining cores. If one of the eight cores receives a manufacturing defect, it can be disabled without having to declare the entire processor defective. Core clock speed 3.2 GHz.
Non-cacheable memory areas
The processor cannot work directly with RAM: first, the data goes into the cache, is taken from there into registers and processed. But this is not suitable for video cards that, for example, need updates directly in the texture buffer. The solution was non-cacheable memory areas and mechanisms for working with them, which avoid unnecessary cache use. One of the features of consoles has always been the presence of a media processor that can compress and decompress a video stream using most modern codecs at a fairly decent speed, at about 150-200 frames per second, just have time to feed it. But if we try to copy these frames through the cache, we get a very depressing situation.
size CPU->GPU
memcpy
------ -------------
256 44.44 Mb/s
512 44.46 Mb/s
1024 44.49 Mb/s
2048 44.46 Mb/s
4096 44.44 Mb/s
8192 33.31 Mb/s
16384 33.31 Mb/s
If we copy a Full HD video frame, which occupies 1920*1080*1.5 (NV12) = 3,110,400 bytes, then we can upload only 13 frames. Consoles provide convenient mechanisms for controlling memory mapping with different caching modes, and if you use WC (write-combining recording mode for non-cached memory) correctly and do something like
wc_fd = mdec_dma_open(SHM_ANON, O_RDWR | O_CREAT, 0);
mdec_dma_ctl(wc_fd, SHMCTL_ANON | SHMCTL_LAZYWRITE, 0, pix_buffer_mem_size);
mmap64(NULL, pix_buffer_mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, wc_fd, 0);
close(wc_fd);
the wc_fd handle now points to memory with different access settings. The results will be completely different (PS5). Considering that the memory is shared, there is no need to copy anything at all, the decoder places its data in memory, we remove the percentage from this chain, it only has to mark the moments when the video card can access this data:
size VD->СPU->GPU VD->GPU(WC)
memcpy memcpy
------ ------------- -------------
256 44.44 Mb/s 2449.41 Mb/s
512 44.46 Mb/s 3817.70 Mb/s
1024 44.49 Mb/s 4065.01 Mb/s
2048 44.46 Mb/s 4354.65 Mb/s
4096 44.44 Mb/s 4692.01 Mb/s
8192 33.31 Mb/s 4691.71 Mb/s
16384 33.11 Mb/s 4694.71 Mb/s
The desktop version has an LLC (Last Level Cache) cache. The cache is called LLC when it is last, for example for a CPU it will be L3 and for a GPU it will be L4. And it connects not only the processor, but also other devices. The video card monitors records in such memory regions and takes them for itself without the participation of the processor.
NUMA (Non-uniform Memory Access)
Non-uniform memory access (NUMA) architecture, where memory latency and bandwidth varies, has become so common that it is accepted as a standard. Why do we have NUMA systems? Initially there was just memory, but as the speed of processors increased compared to memory, it could no longer produce data fast enough, and a cache was added. The cache had to remain consistent with main memory, work an order of magnitude faster, but at the same time be significantly smaller in size. It also had to contain meta information about the data so that it knew when it needed to be written back. If no recording was required, this improved performance since such algorithms are the fastest.
Later, memory volumes became so large (gigabytes and terabytes of RAM) that a small cache was no longer enough to cover all available memory. When working with large amounts of data, data was “evicted” from the cache, and the algorithm began to work even slower than when accessing main memory. Caches of the second and subsequent levels appeared in order, on the one hand, to provide access to a large amount of RAM, and on the other hand, not to be too slow and have time to “feed” the first level cache.
The complexity of managing caches increased even more when there were several cores, and each had its own cache. Data from RAM had to remain consistent with all caches. Before NUMA, each processor/core would send a message to the shared bus on every read or write operation so that other processors/cores could update their cache as needed.
This scheme worked for 2–4 processors/cores, but as their number grew, the response time from all of them became comparable to one clock cycle, and the cache microcode became too complex to scale. The solution to this problem was to use a common mechanism per group of cores, which monitors the cache state for all cores in the group. This structure solves the problem of cache coherence within a group, but the groups themselves already need to be synchronized.
The downside to this approach is that performance decreases when one group of cores accesses memory belonging to another group. Consoles and consumer systems (up to 64 cores) additionally use a ring bus, which adds latency to the transfer of messages between groups when accessing memory.
A little about cache poisoning here, there and about cache coherence (https://habr.com/ru/articles/687146/)
SIMD (Single Instruction Multiple Data)
SIMD technology, first introduced in x86 as MMX in the 90s, made it possible to execute one instruction on multiple data at once, which significantly speeds up array operations, vector and graphics calculations. For example, AVX/AVX2/512 (Advanced Vector Extensions) and SSE (Streaming SIMD Extensions) are more modern versions of SIMD that are actively used in games, high-tech applications, image and video processing. These instructions enable computation on large data sets, accelerating floating point and integer operations in graphics, physics, and machine learning.
Conditional code, like:
for (int i = 0; i < n; ++i) {
sum += a[i];
}
is well recognized by the compiler and vectorized. But something like this already causes problems, and often even with simple examples the compiler also gives in.
for (int i = 0; i < n; ++i) {
if (!a[i]) continue;
sum += a[i];
}
The speed and efficiency of SIMD is due to the fact that when processing, say, four numbers at the same time, it does not require additional clock cycles to process each number separately. Instead, the processor performs an operation on four (or even eight) numbers at once. This optimization allows us to increase productivity by 2–4 times on ordinary tasks and by an order of magnitude on special tasks, but in most cases we get a x1.5-x2 increase on tasks and a special person in the team who handles this.
Out-of-Order Execution
The 1990s also introduced an out-of-order execution mechanism that allows the processor to “guess” which instructions can be executed while others are waiting for their data. This allows the processor to not be idle waiting for data to be loaded from memory, but to continue working, performing other operations for which data is already available.
Modern processors can manage dozens of such “incomplete” operations simultaneously. This helps significantly improve performance in situations where sequential instructions may interfere with each other, especially in the case of long computational chains. In games where the algorithms are specifically redesigned to reduce dependencies, such as physical interactions and particle effects, this can bring some profit and reduce lag.
Power management
The picture above allows us to estimate how much performance we lose due to power management in intensive calculations and the need to “wake up” the processor after idle time, one of the studio’s internal benchmarks for assessing the suitability of the cpu.
Modern processors have come to “peak” performance algorithms and are equipped with a variety of power management features that reduce power consumption in various scenarios. Most modern applications require peak performance for a few seconds, complete a task, and then drop to a low frequency to save power. This makes the speed-to-idle approach (running tasks as quickly as possible and then idling the processor) the most efficient.
However, this approach does not work for games: game developers have been striving for years to load processors as much as possible, keeping them at high performance for as long as possible. This prevents processors from going into power saving mode, which is critical for games. The problem is that reaching operating frequencies takes a significant amount of time, which causes frame jerks and unstable response times. In some cases, you even have to come up with code that runs in the background to prevent the cores from going into power-saving mode when the main threads have nothing to do. This is especially noticeable on the latest generations of energy-efficient processors. On the left is the mobile Ryzen 9-4900HS/RTX2060 on internal tests, with the power management option disabled in the BIOS, on the right it is with the active plan. It’s the same laptop, but this rattling is due to the constant fluctuation of the processor frequency, every 4-5 frames, N runs of the benchmark, each a couple of seconds long, we remember the maximum frame time on each run.
GPGPU
Until the mid-2000s, graphics processing units (GPUs) were limited to an API that provided only basic control over the hardware. However, when the system actually has a second processor with its own memory, often not inferior in parameters to the main one, the question arises, “why is it idle?” So people started using video cards for a wider range of tasks, such as linear algebra problems, which are great for parallel computing. The parallel GPU architecture could process large matrices, such as 512×512 and larger, which a conventional CPU could handle very poorly, to put it mildly. Among the interesting things you can read is Igor Ostrovsky’s article on this topic (https://igoro.com/archive/gallery-of-processor-cache-effects/)
Initially, the libraries used traditional graphics APIs, but later Nvidia and ATI noticed this trend and released extensions that gave us access to more hardware features. At the top level, the GPU includes one or more stream multicore (SM) processors. Each such processor usually contains several computing units (cores). Unlike CPUs, GPUs do not have a number of features, such as large caches or branch prediction, or they are greatly reduced compared to CPUs. Therefore, tasks that are well suited to GPUs are highly parallel and contain data that can be shared among a large number of threads.
GPU memory is divided into two main types: global and shared. Global memory is GDDR, which is listed on the GPU box and is typically 2 GB or higher, available to all threads on all SMs, and is the slowest memory on the card. Shared memory is shared by all threads in one SM. It is faster than global memory, but is not available to threads in other SMs. Both categories of memory require strict access rules; violating these rules results in significant performance degradation. To achieve high throughput, memory access must be properly organized for parallel use by threads in the same group. Just as a CPU reads data from a single cache line, a GPU has a cache line designed to serve all threads in a group when properly aligned. Manufacturers are trying to increase the size of the cache line to ensure simultaneous access to as many SMs as possible. But this only works if all threads access the same cache line. In the worst case, when each thread in a group accesses a different cache line, each thread requires a separate read, reducing effective memory throughput because most of the data in the cache line remains unused.
Why did I attribute this to the CPU? Since we started using GPUs to solve a general range of problems, why not consider them in this context?
Virtual memory
This is another one of the “new” features that we don't have to think about unless we're developing an operating system. Virtual memory makes it much easier to use than segmented memory, but this topic is no longer relevant, so we can leave it at that. Rough virtual memory uses both hardware and software to compensate for the lack of physical memory by temporarily moving data from random access memory (RAM) to disk. Mapping chunks of memory to files on disk allows the computer to treat secondary memory as if it were primary memory. Good article to read (https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/9_VirtualMemory.html)
SMT/HT
The usage is mostly transparent to programmers, but a couple of things are important to know. Typical speedup for enabling HT on a single core is about 25% for general use programs. This improves overall throughput by allowing more data to be loaded into the kernel from two threads of execution, but each thread on the kernel may lose performance. For games where high performance of a single thread is important, it is more profitable to disable HT so that third-party applications are less likely to get into the cache and interfere with work. Even if we assign a thread to a specific core, we still will not be able to utilize more than 80% of the core time; even with load figures close to 75-80%, we will have switches to frames of other threads. As an example, I can cite a case from life when a modeler’s game dropped to 50 fps simply if Houdini was turned on in the background, which was minimized and did “nothing”, but continued to actively use the first core on which the main thread of the game was spinning, the dependence was on the 12700 line from Intel. After turning off HT, the situation improved slightly to 55fps, and in Pix it was clear that Houdini execution blocks were still getting through to the first core. Stable 60 was obtained only when the 3D package was turned off. Although a lot depends on the specific load, and on the hardware, because there was no such dependence on AMD.
Another side effect of the complexity of chips is that performance has become less predictable than before, especially on mobile chips. A good article from NASA why it will be good if there is a 30% increase, and not x2 as everyone hopes 🙂 (https://www.nas.nasa.gov/assets/nas/pdf/papers/saini_s_impact_hyper_threading_2011.pdf)
BPU (Branch Prediction Unit)
A branch prediction unit is a separate part of the processor, a miniprocessor with its own firmware, memory, caches, etc., which predicts the execution of conditional branches in the instruction stream. In modern processors, branch prediction has become critical to performance gains by reducing idle time in the instruction pipeline and using its resources more efficiently. In the latest generations of Intel processors, the on-chip BPU area can occupy up to 15% of the core, and the volume of BPU microcode is comparable to the volume of all other microcode. However, creating large and deep BPU blocks has become an extremely difficult task, and therefore blocks with a depth of more than 64 nesting levels are practically not designed.
The BPU itself consists of several parts:
Branch Prediction Table (BPT) — stores a history of predictions for various branch instructions, recording the instruction address and the branch status (for example, “will be executed” or “will not be executed”).
Branch History Buffer (BHB) — saves the sequence of the last transitions to improve the accuracy of predictions, taking into account the behavior of the program in the past to more accurately predict current transitions.
Branch Target Buffer (BTB) cache — stores the target addresses of transitions, allowing you to quickly move to the desired point in the code if the prediction is successful. BTB also helps load the L1 cache with data that may be needed in this branch.
Cache of last completed branches (Last Recent Branch Buffer) — stores the last N transition addresses (usually no more than 64). The lookup is first performed in this fast cache; if no answer is found, a BPT search is performed.
BPU (Branches)
Interesting article on the topic (https://blog.cloudflare.com/id-id/branch-predictor/)
From my first day in programming, experienced colleagues warned me that branches slow down code execution and should be avoided. On the Intel 12700 series processors, the penalty for branch misprediction is 14 cycles. The frequency of incorrect predictions depends on the complexity of the code and the overall depth of nesting. According to my latest measurements on projects (PS4/5), the frequency of prediction errors was from 1% to 6%. Indicators above 3% are considered significant and may be a signal for code optimization.
However, if a correct branch takes only 1 clock cycle, then the average cost of a branch will be:
At 0.5% errors: 0.995×1+0.005×14=1.0650.995 \times 1 + 0.005 \times 14 = 1.0650.995×1+0.005×14=1.065 clock cycles.
At 4% errors: 0.96×1+0.04×14=1.520.96 \times 1 + 0.04 \times 14 = 1.520.96×1+0.04×14=1.52 clock cycles.
After such measurements on real projects, we began to find fault less with the availability if
as long as they don't make the code harder to read.
While researching the material for a report on branching within the studio, I came to the conclusion that this penalty is not so critical, given that on average only 5% of predictions are wrong. Unpredictable branches are a minus, but most of them lend themselves well to profiling in hot functions and are clearly visible in both PIXe and Razor. It only makes sense to optimize the algorithm where the profiler identifies problems. Over the past twenty years, processors have become more resistant to unoptimized code, and compilers have learned to optimize it, so optimizing branches with ranzy crutches and hacks from the late 90s is no longer so relevant and is mainly required to maximize performance already at the stages of post profiling of a release.
Good lecture on this topic (https://course.ece.cmu.edu/~ece740/f15/lib/exe/fetch.php?media=18-740-fall15-lecture05-branch-prediction-afterlecture.pdf)
And one more observation about branches: modern processors (XBOX, PS4, PS5, almost all mobile processors) ignore instructions for predicting branches, but such instructions help the compiler to better arrange the code. The processor itself relies solely on the BPU, and not on “hints” from the programmer.
Thank you for reading! If you forgot to indicate anything, write in the comments.