Let’s talk about optimizing compilers. Tale six: cycle invariants
This is a series of articles about optimizing compilers in general and LLVM in particular. See all articles in this series:
SSA form
Domination
Undefined behavior
Cycles
CSE
Cycle invariants
Today we’ll talk about several different classes of optimizations that allow the compiler to increase the performance of our programs. For those who are new here, it’s worth reading the articles on SSA form, dominance, cycles, and (optionally) undefined behavior first to understand what we’re talking about.

What is a cycle invariant
The previous article looked at a situation where we evaluated the same subexpression several times and could save money by reusing the first result. Then two different operations in the original program calculated the same result. But there are situations when there is only one operation in the program code, but its results are redundant, since they are calculated over and over again. As you guessed, we are talking about operations in loops. Consider the following example:
int foo(int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
sum += i + n * 2;
}
return sum;
}
As you can easily see, at each iteration we calculate n * 2
to then add this expression to the sum. Because the value n
does not change throughout the function, the expression n * 2
will also not change and depend on the loop iteration number. It is quite obvious that it could be calculated once by performing the simplest optimization:
int foo(int n) {
int sum = 0;
int tmp = n * 2;
for (int i = 0; i < n; i++) {
sum += i + tmp;
}
return sum;
}
Then when executing the program we will save n - 1
multiplication, which is beneficial when n > 1
. So let’s start with the definitions.
Cycle invariant is an expression whose value does not depend on the iteration number of the loop.
Obviously, the values of all expressions evaluated before entering the loop are invariants, and some expressions inside the loop can be invariants if all their operands are invariants. In our example, the value n
calculated before entering the loop, and is therefore invariant, and n * 2
— depends only on invariants and is an invariant even though it was initially calculated inside the loop.
Now let’s look at another example:
int foo(int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
if ((n / 8) > 100) {
sum += i + n * 2;
} else {
sum -= i;
}
}
return sum;
}
As we see, the expression (n / 8) > 100
is an invariant, so branching along it is called invariant branching. Its peculiarity is that, having entered the loop and once passed along one of the branches, we will go to the same branch at each iteration.
In further sections I will tell you how modern compilers can optimize such code.
Removing invariant expressions from a loop
Let’s look at our first example and immediately rewrite it in SSA form:
define i32 @foo(i32 %n) {
entry:
br label %header
header: ; preds = %backedge, %entry
%sum = phi i32 [ 0, %entry ], [ %sum.next, %backedge ]
%i = phi i32 [ 0, %entry ], [ %i.next, %backedge ]
%loop.cond = icmp slt i32 %i, %n
br i1 %loop.cond, label %backedge, label %done
backedge: ; preds = %header
%n2 = mul nsw i32 %n, 2
%i.plus.n2 = add nsw i32 %i, %n2
%sum.next = add nsw i32 %sum, %i.plus.n2
%i.next = add nsw i32 %i, 1
br label %header
done: ; preds = %header
ret i32 %sum
}
An expression in SSA form is a cycle invariant if one of two conditions is met:
It strictly dominates the loop head (that is, it is calculated before entering the loop);
All operands of this expression are cycle invariants.
In the example above you can see that the expressions %n
And 2
dominate the loop head (i.e., previously evaluated), and the expression %n2 = mul nsw i32 %n, 2
corresponds to the second case (all operands are invariants)
It is clear that if the loop does at least two iterations, then calculating invariant expressions outside of it is most likely beneficial. The question naturally arises – if not in the cycle, then where?
Depending on the circumstances, it may be advantageous to evaluate invariant expressions both before entering the loop (for example, in a header block) and after exiting the loop (for example, in an exit block). Depending on what we choose, the optimization algorithm that deals with the removal changes slightly. Usually we separate the ascent (hoisting) and descent (sinking) of invariant expressions along the dominance tree. In LLVM, the optimization pass LICM (Loop Invariant Code Motion) specifically deals with both transformations, but other optimizations also perform similar transformations in different forms. Let’s look at how this is done.
In both cases, the algorithm can be described as follows:
We go through the blocks of the loop in a certain order
We go through the instructions inside the block in a certain order
Whenever we see an instruction that is cycle invariant, we move it to the target block and continue
The order in which blocks and instructions are traversed is determined by considerations of optimality. It is desirable that the algorithm converge in one pass, after which there should be no invariant expressions left in the loop. In the case of lifting instructions (i.e. moving them from the body of the loop to the header), we want to ensure that at the time the next instruction is processed, all its invariant operands have already been removed from the loop so that checking this instruction is cheap, and its removal is legal (recall that all operands of the instruction must dominate it). On the other hand, in the case of descent we want all invariant users of this instruction, by the time it was processed, had already been moved down, since we are required to support the property of the SSA form, according to which the operand dominates its users. Let’s look at this with an example:
define i32 @example_hoist_sink(i32 %a, i32 %b, i32 %c) {
entry:
br label %header
header: ; preds = %backedge, %entry
%iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ]
%sum = phi i32 [ 0, %entry ], [ %sum.next, %backedge ]
%cond = icmp slt i32 %iv, 12
%x = mul i32 %a, %b
br i1 %cond, label %if.true, label %backedge
if.true: ; preds = %header
%y = mul i32 %x, %c
%z = mul i32 %y, %c
%sum.incremented = add i32 %sum, %x
%if.true.cond = icmp eq i32 %x, 1000
br i1 %if.true.cond, label %backedge, label %side.exit
backedge: ; preds = %if.true, %header
%sum.next = phi i32 [ %sum, %header ], [ %sum.incremented, %if.true ]
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret i32 %sum
side.exit: ; preds = %if.true
ret i32 %z
}
Here are the values %a
, %b
And %с
dominate the loop head (and are therefore invariants), and the values %x
, %y
, %z
And %if.true.cond
are calculated inside the loop, but depend only on invariant arguments. Let’s first look at what lifting invariants looks like in this case. The lifting algorithm traverses the blocks in order RPOT (that is, in topological order) and instructions in the block from top to bottom. This bypass provides the required property that invariant arguments are processed before their users. In this case, the algorithm will work like this:
RPOT for blocks of a given cycle –
header
,if.true
,backedge
. They will be processed in this order, with instructions in blocks processed from top to bottom.When processing a block
header
it turns out that all the arguments of the instruction%x
— invariants, so it will be moved to the block%entry
.When processing a block
if.true
The %y instruction will be processed first and moved up, since at this point its operands%x
And%с
both are already invariants, and then – instructions%z
. Then the condition will also be imposed%if.true.cond
.
As a result, we will get the following code:
define i32 @example_hoist_sink(i32 %a, i32 %b, i32 %c) {
entry:
%x = mul i32 %a, %b
%y = mul i32 %x, %c
%z = mul i32 %y, %c
%if.true.cond = icmp eq i32 %x, 1000
br label %header
header: ; preds = %backedge, %entry
%iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ]
%sum = phi i32 [ 0, %entry ], [ %sum.next, %backedge ]
%cond = icmp slt i32 %iv, 12
br i1 %cond, label %if.true, label %backedge
if.true: ; preds = %header
%sum.incremented = add i32 %sum, %x
br i1 %if.true.cond, label %backedge, label %side.exit
backedge: ; preds = %if.true, %header
%sum.next = phi i32 [ %sum, %header ], [ %sum.incremented, %if.true ]
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret i32 %sum
side.exit: ; preds = %if.true
ret i32 %z
}
Are we satisfied with this optimization? On the one hand, we have significantly reduced the amount of calculations during loop execution. On the other hand, it could have turned out that we never went to the block if.true
, and thus, all the derived values were never really useful to us. In this case, it would be more profitable to simply move %x
to the block %if.true
, removing it from the title, and we would have gained more. How to be?
In fact, takeaway %x
in the preheader – this is still more profitable than counting it at each iteration in the header, even if this is not the optimal solution. Concerning %y
And %z
then they are needed only in the output block %side.exit
, therefore it would be more reasonable to apply to them not a rise, but a descent. I will leave the question of which bypass of blocks and instructions to use during descent for the reader to independently decide. In the optimal solution %y
And %z
will be lowered into the exit block. Regarding instructions %x
And %if.true.cond
then it is advantageous to either raise them in the pre-heading or lower them into the block %if.true
, depending on which block is hotter. You can find out if you have profiling information or some additional knowledge regarding the cycle condition. In this case, it can be statically proven that the block if.true
will be executed as many as 12 times, and therefore it is beneficial to raise them in the title. So, the optimal solution for this problem looks like this:
define i32 @example_hoist_sink(i32 %a, i32 %b, i32 %c) {
entry:
%x = mul i32 %a, %b
%if.true.cond = icmp eq i32 %x, 1000
br label %header
header: ; preds = %backedge, %entry
%iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ]
%sum = phi i32 [ 0, %entry ], [ %sum.next, %backedge ]
%cond = icmp slt i32 %iv, 12
br i1 %cond, label %if.true, label %backedge
if.true: ; preds = %header
%sum.incremented = add i32 %sum, %x
br i1 %if.true.cond, label %backedge, label %side.exit
backedge: ; preds = %if.true, %header
%sum.next = phi i32 [ %sum, %header ], [ %sum.incremented, %if.true ]
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret i32 %sum
side.exit: ; preds = %if.true
%y = mul i32 %x, %c
%z = mul i32 %y, %c
ret i32 %z
}
Invariant Loads from Memory
It is worth making a separate reservation about the case when we are dealing with memory. Working with it is not fully reflected in the SSA form, and in LLVM IR is modeled using the properties of reading and writing memory. Currently, there is an attempt to correct the situation using Memory SSA, but this is still an add-on to IR, and not part of it.
Let’s look at a simple example:
define void @example_hoist_load(ptr dereferenceable(32) %p1, ptr dereferenceable(32) %p2) {
entry:
br label %header
header: ; preds = %backedge, %entry
%iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ]
%cond = call i1 @cond()
br i1 %cond, label %if.true, label %if.false
if.true: ; preds = %header
store i32 %iv, ptr %p1, align 4
br label %backedge
if.false: ; preds = %header
%loaded.value = load i32, ptr %p2, align 4
br label %backedge
backedge: ; preds = %if.false, %if.true
%phi = phi i32 [ %iv, %if.true ], [ %loaded.value, %if.false ]
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret void
}
declare i1 @cond() #0
attributes #0 = { memory(read) }
Here, depending on the %cond condition, we either write memory by pointer %p1
or read according to the index %p2
. Can we take the loading out of %p2
in the pre-title?
In fact, the answer to this question is complex and depends on many factors. This carryover is possible if every time we read the value from this memory, it is the same. At a minimum, this is possible if we prove one of the facts:
All writes in the loop are not made to this memory (i.e., the memory area where the
%p1
And%p2
do not intersect)Function
cond()
has the property of monotony: it can return firstfalse
Thentrue
, but not vice versa. If so, then all readings%p2
will be made before entries in%p1
so you can bear reading.
Also, we must not forget that in order to remove reading from the condition, it is necessary to prove that this pointer will be available for reading in the place where you remove it. Reading from an uninitialized pointer can lead to undefined behavior, and I have a separate article about its consequences.
All of this is quite an advanced area of optimization and goes far beyond the scope of our story today. For now, let’s just remember that with memory everything is not so simple, and in order to read or write to memory, pointer invariance is not enough. You also need to know something about the invariance of data using this pointer.
Branching under invariant conditions
So, with the issuance of instructions, everything is more or less clear. But is it possible to remove complex code from a loop with the transfer of control flow if all the transitions in it are invariant? In fact, there are several ways to do this, and I will try to list them, although I cannot promise that I will cover all the nuances here. In LLVM, this is clearly done by the same LICM and Loop Unswitching, and I suspect that some elements of this kind can sometimes perform other transformations.
Let’s look at the simplest example:
define i32 @example_hoist_cfg_simple(i32 %n, i32 %m) {
entry:
%cond = icmp eq i32 %n, %m
br label %header
header: ; preds = %backedge, %entry
%iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ]
br i1 %cond, label %backedge, label %side.exit
backedge: ; preds = %header
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret i32 -1
side.exit: ; preds = %header
ret i32 %iv
}
Here in the loop there is a branch according to an invariant condition, and no important code is executed before this branch. In fact, either at the first iteration of the loop we will immediately exit to %side.exit
If %cond == false
, or we will never get into this block. In this case, you can raise the branching in the same way as we previously raised expressions, in the pre-heading. Let’s do this operation mechanically:
define i32 @example_hoist_cfg_simple(i32 %n, i32 %m) {
entry:
%cond = icmp eq i32 %n, %m
br i1 %cond, label %preheader, label %side.exit
preheader:
br label %header
header: ; preds = %backedge, %entry
%iv = phi i32 [ 0, %preheader ], [ %iv.next, %backedge ]
br label %backedge
backedge: ; preds = %if.true, %header
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret i32 -1
side.exit:
ret i32 %iv ; ТАК НЕЛЬЗЯ!
}
So, we turned the conditional jump in the loop into an unconditional one and raised the branch up. There is a problem with this code: if previously we could use the value %iv
in the block %side.exit
, because there was a relationship of dominance between them, then now you can’t do that. During the transition entry -> side.exit
we never calculated the value %iv
! Therefore, before performing such a conversion, you need to ensure that all values used in the block %side.exit
, will be available there after optimization. For example, here we can use the knowledge that we could get into the %side.exit block only at the zero iteration, and replace ret i32 %iv
on ret i32 0
. Also, in order not to get up twice, we can get rid of the unconditional transition by gluing the blocks header
And backedge
into one. The final optimized code looks like this:
define i32 @example_hoist_cfg_simple(i32 %n, i32 %m) {
%cond = icmp eq i32 %n, %m
br i1 %cond, label %preheader, label %side.exit
preheader: ; preds = %entry
br label %header
header: ; preds = %header, %preheader
%iv = phi i32 [ 0, %preheader ], [ %iv.next, %header ]
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %header
ret i32 -1
side.exit: ; preds = %entry
ret i32 0
}
Let’s complicate the example a little. Now before the branch we want to take out, there is an instruction with a side effect:
define i32 @example_hoist_cfg_side_effects(i32 %n, i32 %m) {
entry:
%cond = icmp eq i32 %n, %m
br label %header
header: ; preds = %backedge, %entry
%iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ]
call void @side_effect_1()
call void @side_effect_2()
call void @side_effect_3()
br i1 %cond, label %backedge, label %side.exit
backedge: ; preds = %header
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret i32 -1
side.exit: ; preds = %header
ret i32 %iv
}
declare void @side_effect_1()
declare void @side_effect_2()
declare void @side_effect_3()
Now we need to take into account that the code above the branch could have written to some external memory, and if the function returns 0
, external code has a right to expect this to be done. Therefore, we cannot allow the branching to be higher than this side effect after optimization. Therefore, the optimized code in this case will be required to duplicate the instructions lying above the branch:
define i32 @example_hoist_cfg_side_effects(i32 %n, i32 %m) {
entry:
%cond = icmp eq i32 %n, %m
br i1 %cond, label %preheader, label %header.copy
preheader: ; preds = %entry
br label %header
header: ; preds = %backedge, %preheader
%iv = phi i32 [ 0, %preheader ], [ %iv.next, %backedge ]
call void @side_effect_1()
call void @side_effect_2()
call void @side_effect_3()
br label %backedge
backedge: ; preds = %header
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %backedge
ret i32 -1
side.exit: ; preds = %header.copy
ret i32 %iv.copy
header.copy: ; preds = %entry
%iv.copy = phi i32 [ 0, %entry ]
call void @side_effect_1()
call void @side_effect_2()
call void @side_effect_3()
br label %side.exit
}
declare void @side_effect_1()
declare void @side_effect_2()
declare void @side_effect_3()
Or, after simplifications and merging of unconditional jumps:
define i32 @example_hoist_cfg_side_effects(i32 %n, i32 %m) {
entry:
%cond = icmp eq i32 %n, %m
br i1 %cond, label %header, label %header.copy
header: ; preds = %header, %entry
%iv = phi i32 [ %iv.next, %header ], [ 0, %entry ]
call void @side_effect_1()
call void @side_effect_2()
call void @side_effect_3()
%iv.next = add i32 %iv, 1
%loop.cond = icmp slt i32 %iv.next, 1000
br i1 %loop.cond, label %header, label %done
done: ; preds = %header
ret i32 -1
header.copy: ; preds = %entry
call void @side_effect_1()
call void @side_effect_2()
call void @side_effect_3()
ret i32 0
}
declare void @side_effect_1()
declare void @side_effect_2()
declare void @side_effect_3()
But sometimes it happens that a branch based on an invariant condition does not exit the loop, or the amount of code that will have to be copied to remove it is very large. In this case, a full version of this transformation is made, which is easiest to describe schematically. Initial code:
cond = ...
for (...) {
// A
if (cond) {
// B
} else {
// C
}
// D
}
turns into
cond = ...
if (cond) {
for (...) {
// A
if (true) {
// B
} else {
// C
}
// D
}
} else {
for (...) {
// A
if (false) {
// B
} else {
// C
}
// D
}
}
In this case, the volume and complexity of the code fragments A
, B
, C
, D
don’t really matter. The transformation is performed purely mechanically; further transformations will get rid of branches based on constant conditions and transform all finodes accordingly. The only drawback here is the doubling of the code size, which negatively affects compilation time and the amount of generated code. Therefore, you should resort to this technique (it’s called non-trivial unswitching) only as a last resort and for the sake of big performance gains.
In fact, as you may have noticed, all simple cases of carrying out an invariant CFG are just special cases of the last, most heavyweight transformation. However, in practice they are also important because they are much cheaper to run in terms of compile time and do not cause excessive code bloat.
Conclusion
In this article, we looked at how we can reduce computational redundancy by removing invariant expressions and branches from loops. Some implementation subtleties were left outside the scope of consideration, in particular those related to undefined behavior (they were previously covered in the corresponding article), as well as the complexity of the interaction of such optimizations with subsequent code generation. The pace of publication of articles has slowed down slightly, but I still hope to eventually review all the interesting high-level optimizations that I had to deal with when I was a compiler engineer.