Compiler optimizations are hard because compilers are forgetful

How exactly would you design an optimizing compiler? More precisely, how exactly would you design and implement specific optimizations? Trying to solve this problem in one sitting is mind-bogglingly difficult, perhaps even impossible, since compiler optimizations largely consist of the following:

  1. Identify a situation in which a specific trick could be applied;

  2. Build an analysis that would allow us to find such a situation (or an extended version of it);

  3. Apply this trick to all suitable points that you can find in the program.

String together a hundred of these “optimization steps” (perhaps overusing step 2 as much as possible) and suddenly your compiler starts making really big, complex changes that build up from smaller ones!

When you start learning compilers, the first thing you need to understand is “peephole optimization”, which is a very simple step that requires almost no optimization. Also, some of these optimizations are probably doable with specific regular expressions. Here are a couple of simple peephole optimizations:

  • replace_muls_with_shifts: x * 2 equivalent x << 1 (the second option is more economical)

  • merge_shifts: (x << 1) << 1 equivalent x << 2 (the second option is more economical)

But, excuse me. What if you have (x * 2) * 2? This option can be converted to x << 2! By the way, we don't even need to write a new gap for this, since we can simply combine the existing ones for the required calculation, right? Let's just do it this way:

(x * 2) * 2
(x << 1) << 1
x << 2

Now let's try it on a real computer:

When I finished optimizing, the finished program looked like this: (x << 1) << 1

UF. Want to see what code I haven't written?

fn do_optimizations() {
    merge_shifts();
    replace_muls_with_shifts();
}

It turns out that for this to work, we need to change the order of our slot optimizations to the “wrong” one. Okay, we can just rearrange them. But wait, if we do too many of these passes, there might not be an optimal order at all, and we'll be constantly skipping “obvious” optimizations. Do we need to loop through this code? If so, how do we know when to stop?? I guess we could check to see if any new rules could potentially be applied after making the changes, and then repeat the run??? But oh my god, where's the guarantee that this loop will ever stop???? Really, at what point do the costs of this approach outweigh the benefits, well, aaAAaaAAAaAAAAA!!!!!

It's impossible to give a definitive answer here, so you just try everything until you get normal results. Then you just try not to think about the order of the operations for a while, until you once again stumble upon a hornet's nest – it turns out that two operations have been performed in a strictly incorrect order for a very long time. It's even more likely that when you swap two operations, the compiler will simply refuse, since no one has tried such a swap before, and there is an unobvious dependency between these operations.

I'll quote an unknown colleague who had to suffer in the guts of compilers much longer than I did:

Designing optimizations is a science, but sequencing operations is an art.

When I wrote this, I burst into tears. a real disaster with mentions on Twitter, because I had accidentally offended half the world's leading compiler experts. They started pelting me with their own beloved articles By design optimizations. All that remains is to consider that it is possible to assemble an ideal compiler, and all previous compilers are “cowards” and “fools”.

❯ Dangers

The merge_shifts gap I described at the very beginning is too simple. It won't be able to handle even (x << 2) << 3! Indeed, I would like to be able to replace all the inclusions (x << y) << z on x << (y + z)!

Compiler: Hello! Yes, I converted x << 30 << 40 V x << 70

Much better – but still, what type is x? Is it a 64-bit integer? Wait… then we have a problem. So if you express this as literally as possible on x64, the code will be interpreted as x << 6because the instruction shl simply takes the module (cuts off) the transition, so that it becomes equal to mod 64. If you have at least a minimal sense of semantics, then it is completely clear that the program used to work completely differently!

That is, we should be a little more careful with our optimizations. As a rule, there will be threatswhich will invalidate the optimizations. In the described case, the threat is that “the sum of the two transitions will be too large a number.”

Sometimes such dangers are easy to spot, but as you try to generalize your optimizations more and more, you will find yourself in less-explored situations. For example, you might want to allow one of the transitions to be expressed by a variable! Of course, one can imagine replacing (x << 10) << z) on x << (10 + z)but… how can we be sure that this code is correct? If the jump is too wide, it can turn into undefined behavior, where the compiler will assume z < 64but it won't tell you anything about 10 + z < 64!

Thus, it is easy to generalize the gap junction to such an extent unreasonablesince we can never be sure that such code is correct (let alone how profitable it will be). Unless… we can make our analysis smarter! If the code we are interested in is wrapped in if z < 20 {then it is valid. Perhaps it will be possible to provide data structures and types of analysis that allow tracking of the data flow and various facts about the code that help to ensure that there are no threats!

Suddenly it turns out that we are starting to build a Serious Optimizing Compiler, and not just writing some holes!

❯ Oops, was that important?

So, at this point, your compiler is already pretty sophisticated. You're building lots of fancy data structures and customizing analysis options, just trying to prove the validity of certain optimizations (let's call this extra information “metadata”). If you can do that, you'll just rewrite the program/IR into a new form that's hopefully better than the old one.

This is quite tricky, because each optimization not only requires you to perform your own Strange Trick, but also to remember to re-commit all the metadata that was left in the old version of the code. In particular, we are talking about debuginfo, which tries to match the current program with the code of the original program. This is necessary in case the user asks you for “x”, so that you can then understand in which register exactly to look for this “x”. Needless to say, there are many such bugs that just happenbut this is not the topic of this article.

What really interests me (and that's why I even started writing this article) is that in practice such optimization moves usually destructive. Transformation x * 2 * 2 => x << 2 works only because we completely forget that there was originally a multiplication here – and move on. Similarly, many optimizations have to rely on things like inlining, constant folding, dead code pruning, in order to get rid of unnecessary details and achieve maximum clarity of the Specific Situation for which the optimization is being undertaken.

And basically that's a good thing! If my program shouldn't do something, I don't want to do it either!

But here's the problem: with so many optimization passes, accompanied by really complex security checks, it turns out that, completely harmless At first glance, things turn out to be very significant, and they are needed so that nothing bad happens with certain optimizations.

Oops! So it was important?

I will be haunted by this problem for eternitywhich immediately scares off many who encounter it for the first time: perhaps, when working with llvm, one should not allow optimizations that throw away supposedly useless casts of pointers to integers. After all, it is precisely such optimizations that can lead to real compilation errors in correct code. YES.

Basically, the idea is this: compilers have detailed information about whether two pointers can overlap. If they definitely can't overlap, then all writes to one can't be observed through the other. But if you cast the two pointers to integers to compare their addresses, then the compiler gets into the swing of things and starts replacing expressions on one pointer with expressions on the other wherever it can find them, until “write to x, read from x” becomes “write to y+1, read from x” throughout the code.

In this case, we are dangerously close to a situation where the compiler may decide that the reads and writes are unrelated, and therefore the writes can be removed. In this case, there is one hope: the compiler should know that casting a pointer to an integer is a huge alias analysis hazard, and keep that in mind… unless another optimization reveals that pointer casts to integers are unnecessary and can therefore be removed.

Imagine Gretel carefully leaving a trail of breadcrumbs to find her way out of the forest, and Hansel seeing Free Crumbs on the path – and eating them. We are now lost in a forest of optimizations, and here a dragon may well eat us.

In a sense, the problem is this: computer programs suffer terribly from in-band signaling. So even the most seemingly innocuous operation, like casting, has a double bottom that requires alignment analysis. But the alternative is… also pretty confusing. With a thousand optimizations and dozens of auxiliary data structures, storing more and more important information about the program out of band, it becomes increasingly difficult for the next optimization step to remember to check all the important information or to ensure that it is up to date.

I don't have any good answers to these questions. Just “Oh shit!” and “Compilers are hard.” You can call it a whining fit. Maybe, rvsdg or some other magic wand will fix this, and mere mortals will be able to judge the semantics of compilers.

❯ Hardware Cache Hits Features

One last thing. I recently learned that hardware memory models are supposedly much more robust and well-defined than the memory models used by compilers. But why? And if so, why don't compilers just adopt these kinds of models?

As far as I can tell, this is just another consequence of all of the above. We expect the compiler to be destructive when optimizing, and the hardware to more or less “do what it's told.” Many like to point out that the hardware does a lot of internal system tricks, “compiler-like” in essence, but in the end, the hardware interprets many instructions quite literally. And there's always the full source code that runs right on the hardware.

Similarly, the CPU may have some cache tricks in place that “really” prevent the load or store operation from being performed. However, the CPU doesn't care. it is knownthat in this case a logical access to memory is assumed, and that all in-band signals accompanying this access rely on the observance of certain guarantees on the part of the processor.

On the other hand, the compiler is expected to see the load operation so that it can cache the loaded information, or load it if needed, or permanently discard it – and no reminders are needed. Even if the compiler relied on an auxiliary data structure to remind it that the loaded information existed, the need for that structure disappears once we build the finished binary. It just disappears.

And here's why, my dears

drumroll

There is memory_order_consume in the world, and it is a disaster! memory_order_consume just tries to cover the more subtle ordering guarantees that operate at the hardware level.

You see, all tools are willing to give you strong guarantees about accessing the same memory regions (even for non-strict atomic operations there is a “Total Modification Order”). But the situation becomes much more tenuous when you need to ensure mutual consistency between two completely unrelated memory regions. Typically, the solution to this problem involves using some kind of “barriers” in memory, upon reaching which we tell all CPU cores to pause and “come to a common denominator”, but this is a very heavy and expensive operation.

All hardware memory models agree on this: if you load pointer B from pointer A, then all accesses to B actually refer to A! Yes, there is a lot to be confused about here, let's look at a simple example:

static mut LATEST_DATA: AtomicPtr<u64> = ptr::null_mut();
static mut MY_DATA: u64 = 0;

unsafe fn thread1() {
    MY_DATA = 5;                                // инициализируем наши данные
    LATEST_DATA.store(&mut MY_DATA, Release);   // делимся ими
}

unsafe fn thread2() {
    let latest_ptr = LATEST_DATA.load(Consume); // Получаем новейшие
    let data = *latest_ptr;                     // Считываем данные
    println!("{data}");
}

One thread writes some data to MY_DATA, and then tells all other threads to go read it. It does this by storing a pointer to that data in LATEST_DATA (using Release semantics, so that writing to MY_DATA is certain to happen before any reads). Another thread then loads that pointer (Consume semantics, no barrier required) from LATEST_DATA and reads the contents of MY_DATA through that pointer. Since the read operation that LATEST_DATA is associated with, obviously requires first reading the LATEST_DATA pointer, here it is added obvious cause and effect relationship, and no additional synchronization is required!

“Do something nasty, then throw a pointer to that nasty thing into a data structure” is the bread and butter of anyone used to working with non-blocking data structures, so we'd definitely benefit greatly if we could make this operation more efficient (and intuitively correct!).

Unfortunately, this wonderfully elegant system is completely destroyed by optimizing compilers, because we tell them to comb through the code and eliminate every load and store they can. Your carefully crafted chain of cause and effect can be destroyed when the compiler decides that a load is redundant and removes it. A more subtle situation arises when the compiler notices that two pointers are guaranteed to be equal, and redirects all accesses that went through the other through one of them!

Oops, that was important!

As a result, all compilers treat consume as just a slightly more expensive analogue of acquire, because they simply cannot properly support it while supporting all the other optimizations we expect from them.

Naturally, such a situation demotivates people who work closely with hardware. That's why they often ask for some “meaningless” things, for example, “freeing up the loaded” . In this case, you are simply asked to comply with the guarantees that are already in effect at the hardware level: “look, they are already here.” But, alas, the memory model used when working with memory is simply not adapted to such guarantees.

Yeah, compilers are damn complicated.

Read also:


News, product reviews and contests from the Timeweb.Cloud team – in our Telegram channel

Go ↩

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *