Getting the most out of Go

Developers who use Go are faced with the challenge of squeezing maximum performance out of every line of code. But what to do if there is nothing to optimize, but you still need to increase speed?

My name is Nikita Galushko. I am a senior programmer-developer in the department of high-load systems and optimization of VKontakte. In this article I will share what tricks will help you use Go to its full potential.

What will the article be about?

  • I’ll tell you about memory, namely about small-size objects and the interface, and I’ll show you a couple of tricks with the stack.

  • I’ll share how much BCE (Bounds Check Elimination) can affect the performance and why not all cycles forloop equally useful.

  • I will reveal the features that the current Go compiler imposes on our code.

  • I will touch on topics such as optimal conversion string -> []byte And []byte -> stringconcatenation and related optimizations, sorting []string – this is important because our everyday programs often use strings, and with type string there are many myths associated.

First and foremost, we write code for humans. But if you need to squeeze out maximum performance, then you will have to abandon this approach and start writing code for the compiler. I haven't seen high-performance code that is easy to read, test, and modify.

All of this will be important only if you have already thought about the data structures and algorithms that are used in your application. There is no point in trying to win nanoseconds or units of megabytes if the wrong data structure is used.

What are small-size objects in Go?

These are objects no larger than four machine words. For 64-bit systems it is 32 bytes. To understand how much they can affect performance, let's write a couple of benchmarks.

Imagine that you have a small-size object that consists of four int. Next to it is an object that differs by only one int. The functions are simple – they are the sum of two fields of an object.

type SmallObject struct {
    a, b, c, d int
} 
type PlainObject struct {
    a, b, c, d int
    e             int
}
func sum1(obj SmallObject) int {
return obj.a + obj.d
}
func sum2(obj PlainObject) int {
return obj.a + obj.d
}

The benchmark for this code is also as simple as possible.

func BenchmarkSmall(b *testing.B) {
var ret int
for i := i; i < b.N; i++ {
obj := SmallObject{a: i, d: i}
ret = sum1(obj)
}
_ = ret
} 
func BenchmarkPlain(b *testing.B) {
var ret int
for i := 0; i < b.N; i++ {
obj := PlainObject{a: i, d: i, e: i}
ret = sum2(obj)
}
_ = ret
}

We construct the object on the spot and call the function depending on the type. If we run it, we will see that the function that works with a small-size object runs 2–2.5 times faster.

BenchmarkSmallObject-8     1000000000 1.037 ns/op 0 B/op   0 allocs/op

BenchmarkPlainObject-8     423361981 2.571 ns/op 0 B/op    0 allocs/op

This is very strange: we have two almost identical objects that differ only by one int. Let me remind you that our goal is to save nanoseconds, since the function can be called thousands and millions of times. To figure this out, we use assembler. We will refer to it throughout the article. I'll share how you can do this yourself.

There is a resource Compiler Explorer (godbolt.org). I like using it more than compiling with the go tool myself, because you can experiment with versions of Go, with the architecture for which you will compile. But if you want to experiment on your local machine or do not have access to godbolt, then you can use the following command:

go tool compile -S main.go > main.s

Let's compile and look at the assembler.

From the assembler you can see that the function that works with a small-size object is a pair of instructions. I note that here we interact with registers directly. If we are working with an object where there is only one int more, we spend time filling the stack and then reading it as the function progresses.

Small-sized objects don't just save memory. They allow the compiler to generate more performant code by working directly with CPU registers. There is one thing – this is a very rare case; it is difficult to design a system only on such micro-objects.

Why are interfaces bad when it comes to performance in Go?

I will try to convince you of this with several illustrations. Imagine that you have an interface with only two methods.

type Summer interface {
SumValue(Object) int
SumPointer(*Object) int
}

One method takes an argument by value, and the second takes an argument by pointer. Let's assume that the object is as simple as possible – only two int.

type Object struct {
A, B int
}

Let’s also assume that all the optimizations that work for small-size objects will also work here. The code is also simple – the sum of these two fields.

type Sum struct{}
func (v Sum) SumValue(obj Object) int {
return obj.A + obj.B
}
func (p Sum) SumPointer(obj *Object) int {
return obj.A + obj.B
}

Let's run a simple benchmark.

BenchmarkSumPointer-8    101133675    11.59 ns/op    16 B/op    1 allocs/op

BenchmarkSumValue-8      609446936    1.855 ns/op     0 B/op    0 allocs/op

As a result, we will see that the method that accepts an object by pointer works slower, and also allocates something in the heap. The reason is escape analysis, since it is guided in this place by a fairly simple rule: if more than one goroutine has access to a variable, then the variable must be sent to the heap.

You will object that there were no goroutines. And I’ll confuse you even more: if we work with the type directly, without an interface, then there will be no allocation.

Why is the rule the same, but the behavior is different? Because this rule has a small continuation:

If more than one goroutine has access to a variable, then the variable must be sent to the heap. Or if it cannot be proven that only one goroutine can access this variable.

The problem is that escape analysis doesn't go into our interface implementation. It sees that we have an interface and a call. We need to decide whether to send the variable to the heap or not. With a pointer, the variable may change later when other functions are called. But if we pass an argument by value, then a shallow copy occurs every time, and we can effectively just postpone making that decision.

I’ll note when interfaces can still provide allocation. The most important thing to remember is that any cast of a complex type to an interface is always an allocation, with the exception of a few rules:

  • pointers to the interface;

  • int, int32, int6, uint, uint32, uint64 – casting int up to 65 thousand;

  • very simple types (empty struct or struct with a field whose size does not exceed one machine word).

Therefore, if possible, avoid using interfaces. Forget they're in Go if you want to get the most out of them.

There is another example, a fairly common pattern that I see in various Go applications: there is an interface and several of its implementations in one package. In our case, this is fileDumper And dbDumper. The code is lightweight, well tested and extensible. But there are a number of problems.

Let's imagine that we have mainwhere we create each Dumper and call makeDumpwhich simply calls a function on the interface.

If we compile this code into assembler, we will see a story like go:itab.

Let me remind you that an interface is a type consisting of two words, where there is a pointer to the data and a pointer to the type. Problem go:itab The point is that a virtual call is always a table search. And while we are looking, useful work is not being done.

I suggest doing enum from the implementations that we need, and leave only one type.

type dumpKind int8
const (
_ dumpKind = iota
toFile
toDB
)
type Dumper struct {
kind dumpKind
}
func (d Dumper) Dump(s string) {
switch d.kind {
case toFile:
// ...
case toDB:
// ...
}
}

We will not have interfaces, and in the implementation we will select one or another implementation according to the switch.

At the same time, this code does not provide allocations like the previous one.

BenchmarkInterfaceDumper-8 28651759  41.81 ns/op 48 B/op 2 allocs/op 

BenchmarkStructDumper-8    164893970 7.111 ns/op 0 B/op 0 allocs/op

But there is a price for performance:

  • unsightly and difficult to expand if you don't know what it was done for;

  • difficult to test.

Some stack tricks

Important thesis: in Go, objects up to 64 KB will be allocated on the stack. We remember about escape analysis, which can, according to its own rules, send a variable to the heap, therefore:

  • do not use in another goroutine;

  • we do not return from the function;

  • do not pass to interfaces;

Let's imagine that there is a code where in the function foo allocate two slices of 64 KB each.

func foo() {
a := make([]byte, 641024)
b := make([]byte, 641024)
}

And in function bar we do the same thing, only 2 bytes more.

func bar() {
a := make([]byte, 641024 + 1)
b := make([]byte, 641024 + 1)
}

Seems like what could go wrong? But anything can go wrong.

BenchmarkFoo-8  1000000000   0.3189 ns/op        0 B/op 0 allocs/op

BenchmarkBar-8       73376 16924 ns/op   147456 B/op 2 allocs/op

Function foo runs an order of magnitude faster than barand makes 0 allocations.

Here we need to make a separate note that the more memory you allocate on the stack, the more the size of the function frame swells, which means that stack reallocation becomes more expensive.

Okay, 64 KB is allocated on the stack, but what if we need to allocate larger slices and strictly on the stack? Will you have to create several of them, 64 KB each, and juggle indexes? No. There is an interesting hack in Go that allows you to instantiate a slice in place by setting a specific index to some value – and this will not give allocations.

func baz() {
a := []byte{1_000_000: 0}
}
BenchmarkBaz-8   3636	   302480 ns/op   0 B/op   0 allocs/op

For me personally, this is a gray area. Eat issue about this theme. On the one hand, the ticket is already 3.5 years old and nothing has happened to it, but on the other hand, the comments indicate that there are questions.

Let's talk about the CPU

BCE – Bounds Check Elimination

Go positions itself as a safe language that tries to protect us from going beyond the boundaries of an array or slice using BCE – Bounds Check Elimination. But how much of a performance impact does this security have?

Imagine that there is the simplest possible code that []byte receives uint64 (taken from the standard library).

func toUint64(b []byte) uint64 {
return uint64(b[0]) |
uint64(b[1])<<8 |
uint64(b[2])<<16 |
uint64(b[3])<<24 |
uint64(b[4])<<32 |
uint64(b[5])<<40 |
uint64(b[6])<<48 |
uint64(b[7])<<56
}

The code takes an element by index and performs two bit operations. It seems that this function should work as quickly as possible. But if we run the benchmark, we will see that this code takes as much as 3 nanoseconds to execute.

func BenchmarkBCE(b *testing.B) {
buf := make([]byte, 8)
var ret uint64
for i := 0; i < b.N; i++ {
ret = toUint64(buf)
}
_ = ret
}
BenchmarkBCE-4    342406828    3.361 ns/op

A function of this kind can be executed thousands and millions of times, so you want to make it as fast as possible.

To understand why it takes so much time, you need to go into assembler.

main_toUint64_pc91:
MOVL    $7, AX
MOVQ    AX, CX
PCDATA  $1, $1
CALL    runtime.panicIndex(SB)
main_toUint64_pc104:
MOVL    $6, AX
MOVQ    AX, CX
CALL    runtime.panicIndex(SB)
main_toUint64_pc117:
MOVL    $5, AX
MOVQ    AX, CX
NOP
CALL    runtime.panicIndex(SB)
main_toUint64_pc133:
MOVL    $4, AX
MOVQ    AX, CX
CALL    runtime.panicIndex(SB)
main_toUint64_pc146:
MOVL    $3, AX
MOVQ    AX, CX
CALL    runtime.panicIndex(SB)
main_toUint64_pc159:
MOVL    $2, AX
MOVQ    AX, CX
CALL    runtime.panicIndex(SB)
main_toUint64_pc172:
MOVL    $1, AX
MOVQ    AX, CX
CALL    runtime.panicIndex(SB)
main_toUint64_pc185:
XORL    AX, AX
MOVQ    AX, CX
NOP
CALL    runtime.panicIndex(SB)
XCHGL   AX, AX

We see that each access to the index actually turns into if, which checks for array out-of-bounds. And if the condition is true, we throw a panic, if not, then we make a jump and continue execution.

How can we get rid of unnecessary checks? What if we know that our slice already has the right number of elements? It's simple – just touch the furthest element.

It is obvious (to both humans and Go) that if there is a furthest element (in this case, index 7), then so are all the others. Let's add just one line to our original function.

func toUint64(b []byte) uint64 {
_ = b[7]
//...
}

Then we run the same benchmark again and see a 2-fold acceleration.

BenchmarkBCE-4    715002706    1.657 ns/op

Cycles

In Go we have two kinds of loops: forloop And forrange. I'll show you how much you have to pay every time we use forrangebecause this loop is copy-based. Every time you get the next element of the slice during iteration, it is shallow-copied.

The test will be simple. Object with one field int and two functions that run through the slice of objects, summing the value of a single field.

type Obj struct {
index int
}
func sumRange(objects []Obj) int {
ret := 0
for _, v := range objects {
ret += v.index
}
return ret
}
func sumLoop(objects []Obj) int {
ret := 0
for i := 0; i < len(objects); i++ {
ret += objects[i].index
}
return ret
}

If we run the benchmark (it’s trivial, so I won’t show the code), we’ll see that the function that uses the usual, standard forloop by index, works 4 times faster than forrange. The fatter the object inside the slice, the greater this difference will be.

BenchmarkForRange-4     443161     2371 ns/op

BenchmarkForLoop-4   1863501     641.7 ns/op

What else is important not to forget?

  • The compiler does not vectorize operations (you have to write it in assembler). The code from the example about loops should in fact lend itself well to vectorization, but alas, Go currently cannot do this. Therefore, if you want to use vector instructions, you will have to write it yourself in assembler 🙂

  • Processor line cache. Don't forget about cache lines. A virtual call in interfaces is also a problem for the cache, because each time we flush out instructions that the CPU has already taken for execution. At the same time, the peculiarity of Go is that we do not have threads, but goroutines, and some features, such as, for example, attaching a specific goroutine to a specific CPU core, will not work (even though this is a common pattern in network applications).

Strings

When we talk about performance and the topic of lines comes up, Unsafe immediately appears.

Unsafe

Unsafe in Go has a number of problems. The first is that translation from string to []byte and back gives an Unsafe line, that is, any change to the original []byte will be reflected in the string you receive.

But starting with Go 1.20, you don't have to write your own functions that translate from []byte to the line and from the line to []bytebecause it's already in the standard library. appeared in it unsafe.StringData And unsafe.String. This is important to know, because very often on Stack Overflow and similar sites you can come across answers where such functions are implemented incorrectly.

What are the optimizations over strings in Go?

  • Conversion in a loop forrange does not cause allocations.

    s := "GolangConf"
    for , b := range []byte{s} { // 0 аллокаций
    //...
    } 

    When you iterate through forrange line by line, bringing it into []byte or in []runethis does not give allocations.

  • When comparing slices of bytes when we translate them to a string.

    s1 := []byte{0: 65, 1024: 90}
    s2 := []byte{0: 65, 1024: 90}
    if string(s1) == string(s2) { // 0 аллокаций
    // ...
    }

    If you want to compare two []byte as strings (let me remind you that they cannot be compared via == directly), then this also does not give allocations. You don't need any Unsafe in this place 🙂

  • When searching in map.

    var m map[string]any
    key := []byte{...}
    value := m[string(key)] // 0 аллокаций

    Interestingly, if the key is in map is a string, and on our hands []bytethen casting a byte slice to a string when searching for an allocation key will not yield.

  • When concatenating.

    There is a huge layer of optimizations for concatenation. For example, we have two byte slices foo And bar. In line a we concatenate them with a non-empty string, and in the line b we do the same thing, only the line is empty.

    foo := []byte{100: 65}
    bar := []byte{100: 65}
    a := "" + string(foo) + string(bar)
    b := "" + string(foo) + string(bar)

    It turns out that when creating a line a there will be only one allocation, although for the line b – as many as three.

    To understand why this is so, you need to look at the assembler again.

    On the left is a concatenation with a non-empty string, on the right with an empty string. Concatenation with a non-empty string essentially means calculating the aligned address of all our arguments and calling a function from the package at runtime. In the case of concatenation with an empty string, we need to first foo And bar result in a string, and only then call the same function.

    The point is that Go, realizing that it already has some source string with a non-zero length, can immediately allocate all memory based on it, that is, it does not have to allocate first for foo, then for bar and then for the entire line.

    At the same time, if we go to the package where this concatenation function is implemented, then we will be greeted by a strange thing (highlighted in the picture below with a red block), which means that there is always some kind of temporary buffer of 32 bytes in size.

    How can we use this? We concatenate the strings in chunks of 32 bytes. We don't need to experiment with concatenation with an empty and non-empty string.

    If we run the benchmark, we will see that concatenation using 32-byte chunks also makes just one allocation for the entire source string.

    BenchmarkSimpleConcat-8    14532418    72.20 ns/op    512 B/op 3 allocs/op

    BenchmarkChunkConcat-8     16568988    64.24 ns/op    256 B/op 1 allocs/op

    This knowledge has another advantage. Any benchmarks that you write must use data in a volume close to the volume of data that you will have in production. Otherwise, you may not see the allocations that actually exist.

    The only problem with this hack is that it is only applicable in code generation, because if you write the same thing but with a loop, it will work much worse.

String slice sorting

Just recently appeared on GitHub ticket. He talks about what's currently in Go sort.Stringswhich works directly with strings, is 30% faster than the generic solution slices.Sort. This is due to the fact that sort.Strings makes only one comparison instead of three. Currently in Go you can't optimally sort a list of objects by a string field.

What do we get in the bottom line?

  • I explained what small-size objects are and how they allow the compiler to produce more optimal code.

  • I really hope I've convinced you that interfaces are evil. If you want to get the most out of them, you need to forget about them.

  • I showed how much of a performance impact a simple array out-of-bounds check can have.

  • Demonstrated how much a particular loop affects performance and that working with strings in Go is well optimized.

I hope my experience will help you get the most out of Go – and make your product even cooler and faster.

The article is based on a report at the professional conference Saint HighLoad++. Here You can watch a recording of the performance.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *