Analyzing bound checks in Go by CPU profile
Today we will analyze binaries along with CPU profiles in order to create advanced execution profiles based on them. We can use these augmented profiles to estimate the time the program spends checking for arrays and slices out of bounds.
Introduction to the problem
Profiling in Go collects call stacks along with the position of the currently executing instruction. While this granularity is very useful when viewed in disasm mode, it doesn’t tell us much about what the semantics of the instructions being executed are.
It’s easiest to work with the feature level. By function, we can aggregate samples, build call trees and flamegraphs.
Some of the operations built into the language generate some kind of understandable call. For example, operators for working with channels (runtime.chanrecv
, runtime.chansend
).
Others can turn into multiple calls: append(b1, b2...)
will create challenges runtime.growslice
and runtime.memmove
. This already complicates the analysis of profiles, but at least we have something to cling to.
The hardest part is the operations that embed their code right where they are used, without any call. This is effective in terms of execution speed, but profiling such structures with standard tools becomes impossible. The machine code of these operations is mixed with the function using them, and in the profile these values will increase the flat / self time of this function.
Price of bound checks
Let’s benchmark a fairly simple function for getting the last element of a slice:
package bcbench
import "testing"
//go:noinline
func getlast(xs []int) int {
return xs[len(xs)-1]
}
func BenchmarkGetLast(b *testing.B) {
xs := []int{0, 2, 3, 5}
for i := 0; i < b.N; i++ {
getlast(xs)
getlast(xs)
getlast(xs)
getlast(xs)
}
}
In order for the binary with the test to be available after the benchmarks, we will build it explicitly through -c
. We will need a lot of runs (count), since the operations under test themselves are very fast – the sampling nature of profiling requires longer intervals so that we have a chance to remove the stacks at the right moments.
$ go test -c
$ ./bcbench.test -test.count=60 -test.cpuprofile=cpu.out -test.bench=. .
Now open the profile in the browser:
$ go tool pprof -http=:8080
I’ve marked the lines that are directly related to out-of-slice checks. We will return to this structure in a moment, but first I want to draw your attention to the fact that most of the samples did not cover the associated instructions at all. This does not mean that bound check is free in this case. This is rather a sign that by writing samples 100 times per second it is very difficult for us to catch some parts of the machine code.
Let’s rewrite the function getlast
so that there are no checks inserted by the compiler:
//go:noinline
func getlast(xs []int) int {
if len(xs) != 0 {
return xs[len(xs)-1]
}
return 0
}
Let’s run similar benchmarks, collect a profile, compare performance.
We can see that there are no checks inserted by the compiler. Instead of CMPQ, we have TESTQ in the code and it may seem that it works slower, because we spent 4 whole seconds on this instruction. But it’s not, we’re just “lucky”.
When comparing the performance, it is clear that the second version of the function is faster, although its sources are more verbose:
$ benchstat old.txt new.txt
name old time/op new time/op delta
GetLast-8 12.5ns ± 2% 9.3ns ± 1% -25.93% (p=0.000 n=10+10)
And, again, this is more of a side effect: the version without calling panicindex
does not have a frame, so the function itself is more lightweight. Our problem is that even with all the information, it is not very easy to measure the time spent on the bound check.
The benchmark itself can be rewritten by removing
go:noinline
, but then you need to make sure that the compiler does not guess that our slice is immutable and there is no point in checking its length inside the loop. Write correct benchmarks – it’s not trivial.
Where does bound check come from
Checks for out-of-bounds array are different. Let’s try to find out what in general in our code may require these checks.
For arrays whose type is part of their length, checks are inserted only in the case of non-const indexes.
func f(i int) {
var a [8]int
println(a[0]) // не требует проверок
println(a[i]) // проверка i < len(a)
}
For slices, these checks are inserted for almost any reason. In practice, some of these checks are removed as redundant when the compiler can prove that the index is exactly in the allowed range.
When slicing (taking “slices”) arrays, the same checks occur, so for a[i:]
the compiler will insert a check for i <= len(a)
.
func f(b []byte, i, j int) {
println(b[i]) // проверка i < len(b)
println(b[i:]) // проверка i <= len(b)
println(b[:j]) // проверка j <= cap(b)
println(b[i:j]) // проверка i <= len(b) && j <= cap(b)
}
It is important to note here that different operations may generate slightly different fragments. The results also depend on the Go compiler version. This article is written based on go 1.17
and 1.18
.
We look at the disassembler of bound checks
All code examples are relevant for x86-64 (GOARCH=amd64).
Let’s start with the simplest case:
func indexing(i int, xs []int) int {
return xs[i]
}
Let’s look at the generated code:
SUBQ $24, SP
MOVQ BP, 16(SP)
LEAQ 16(SP), BP
MOVQ BX, xs+40(FP)
CMPQ CX, AX
JLS bc_fail
MOVQ (BX)(AX*8), AX
MOVQ 16(SP), BP
ADDQ $24, SP
RET
bc_fail:
CALL runtime.panicIndex(SB)
This is a fairly general scheme for a bound check:
- Comparison instruction (CMPQ or TEST in case of comparison with 0)
- Conditional jump (JLS above is JBE)
- At the jump point, one of the panicking functions is called
Before the CALL instruction, there may be several instructions like MOV or LEA to pass the necessary arguments.
Let’s take a few more functions for experiments:
func slicingFrom(i int, xs []int) []int {
return xs[i:]
}
func slicingTo(i int, xs []int) []int {
return xs[:i]
}
func slicingFromTo(i, j int, xs []int) []int {
return xs[i:j]
}
If we look at the resulting code through the disassembler, we will notice that much of the same code is generated for these cases, but the “panic” functions are different.
indexing
—runtime.panicIndex()
slicingFrom
—runtime.panicSliceB()
slicingTo
—runtime.panicSliceAcap()
slicintFromTo
—runtime.panicSliceB()
andruntime.panicSliceAcap()
In fact, even this is not a complete set. The full list of features can be found in compiler sources.
This means that there is more than one function that can be called if the validation fails. We need to know the addresses of all these functions.
If the machine code matches according to the algorithm above and ends with a call to a special panic function, such as panicSliceAcap
then we can be sure that we have a bound check.
In this case, we will count samples only for two instructions from the entire recognized sequence: CMPQ and JLS.
Collecting addresses of panic functions
I will consider an example with an ELF binary.
There is a package in the Go standard library debug/elf. We will use them.
var bcFuncNames = map[string]struct{}{
"runtime.panicIndex": {},
"runtime.panicIndexU": {},
"runtime.panicSliceAlen": {},
"runtime.panicSliceAlenU": {},
"runtime.panicSliceAcap": {},
"runtime.panicSliceAcapU": {},
"runtime.panicSliceB": {},
"runtime.panicSliceBU": {},
"runtime.panicSlice3Alen": {},
"runtime.panicSlice3AlenU": {},
"runtime.panicSlice3Acap": {},
"runtime.panicSlice3AcapU": {},
"runtime.panicSlice3B": {},
"runtime.panicSlice3BU": {},
"runtime.panicSlice3C": {},
"runtime.panicSlice3CU": {},
"runtime.panicSliceConvert": {},
}
bcFuncAddresses := make(map[int64]struct{})
// exeBytes - []byte из нашего бинарника
f, err := elf.NewFile(bytes.NewReader(exeBytes))
if err != nil {
return fmt.Errorf("parse ELF: %v", err)
}
symbols, err := f.Symbols()
if err != nil {
return fmt.Errorf("fetch ELF symbols: %v", err)
}
for _, sym := range symbols {
if _, ok := bcFuncNames[sym.Name]; !ok {
continue
}
addr := sym.Value // !!! <----------------------------
bcFuncAddresses[int64(addr)] = struct{}{}
}
Unfortunately, this code is not quite correct. Pay attention to the line with the mark.
In order for the addresses to be successfully mapped to what will be inside the CPU profile, we need to perform some transformations. It’s time to parse the CPU profile.
Package github.com/google/pprof/profile perfect for our task.
// Использую ReadFile, а не os.Open, чтобы не возиться с файлом,
// который потом нужно будет закрывать
data, err := os.ReadFile("cpu.out")
if err != nil {
return fmt.Errorf("read profile: %v", err)
}
p, err := profile.Parse(bytes.NewReader(data))
if err != nil {
return fmt.Errorf("parse profile: %v", err)
}
Now let’s change the address calculations:
- addr := sym.Value
+ addr := sym.Value + p.Mapping[0].Offset - p.Mapping[0].Start
Hooray! Now we can define CALL <panicfunc>
according to the address table.
Parsing machine code
Bypassing the samples, we use loc.Address
to get the address of the currently executing instruction. Then we check if it is possible to recognize the bound check in the function from this position markBoundCheck
.
for _, sample := range p.Sample {
if len(sample.Location) == 0 {
continue
}
for _, loc := range sample.Location {
if len(loc.Line) == 0 {
continue
}
m := loc.Mapping
addr := int64(loc.Address + m.Offset - m.Start)
// ctx содержит подготовленный ранее bcFuncAddresses
markBoundCheck(ctx, exeBytes, addr)
}
}
The picture below is intended to help you understand the high-level structure of the profiles. Samples are placed like this:
It remains to implement markBoundCheck
. To decode instructions, we will use golang.org/x/arch/x86/x86asm.
func markBoundCheck(ctx *context, code []byte, addr int64) bool {
// 1. Инструкция сравнения
cmp, err := x86asm.Decode(code[addr:], 64)
if err != nil {
return false
}
if cmp.Op != x86asm.CMP && cmp.Op != x86asm.TEST {
return false
}
// 2. Условный прыжок
jmp, err := x86asm.Decode(code[addr+int64(cmp.Len):], 64)
if err != nil {
return false
}
jumpFrom := int64(addr) + int64(cmp.Len)
var jumpTo int64
switch jmp.Op {
case x86asm.JBE, x86asm.JB, x86asm.JA:
rel, ok := jmp.Args[0].(x86asm.Rel)
if !ok {
return false
}
jumpTo = jumpFrom + int64(rel) + int64(jmp.Len)
default:
return false
}
// 3. Вызов паникующей функции
call, err := x86asm.Decode(code[jumpTo:], 64)
if err != nil {
return false
}
if inst.Op != x86asm.CALL {
return false
}
funcRel, ok := call.Args[0].(x86asm.Rel)
if !ok {
return false
}
funcAddr := jumpTo + int64(funcRel) + int64(call.Len)
if _, ok := ctx.bcFuncAddresses[funcAddr]; !ok {
return false
}
// Записываем, что CMPQ (addr) и прыжок (addr+cmp.Len) относятся
// к bound check операциям
ctx.bcAddresses[addr] = struct{}{}
ctx.bcAddresses[addr+int64(cmp.Len)] = struct{}{}
return true
}
This code has been simplified a bit. I provide a link to the full implementation at the end of the article.
Now we can tell at any instruction address from the CPU profile whether it belongs to the bound check or not.
Add runtime.boundcheck to profile
The easiest way is to add new inline frames to existing ones. profile.Location
objects.
But first, we need to add information about the new feature to the profile. Let’s call her runtime.boundcheck
:
// ID начинаются с 1, то есть p.Function[i].ID == i+1
id := len(p.Function) + 1
boundcheckFunc := &profile.Function{
ID: uint64(id),
Name: "runtime.boundcheck",
SystemName: "runtime.boundcheck",
Filename: "builtins.go", // Не имеет значения
StartLine: 1, // Не имеет значения
}
p.Function = append(p.Function, boundcheckFunc)
Now let’s go through all the samples again and insert runtime.boundcheck
for all samples that are currently executing the associated code.
for _, sample := range p.Sample {
if len(sample.Location) == 0 {
continue
}
loc := sample.Location[0]
if len(loc.Line) == 0 {
continue
}
m := loc.Mapping
addr := int64(loc.Address + m.Offset - m.Start)
if _, ok := ctx.bcAddresses[addr]; ok {
insertLine(loc, boundcheckFunc)
}
}
Function insertLine
can be implemented like this:
func insertLine(loc *profile.Location, fn *profile.Function) {
if loc.Line[0].Function == fn {
return // Видимо, мы уже вставляли сюда новый вызов
}
newLine := profile.Line{
Function: fn,
Line: lines[1].Line,
}
loc.Line = append([]profile.Line{newLine}, loc.Line...)
}
After working with profile objects, it remains for us to call the method Write
and save it to a new file:
f, err := os.Create("cpu2.out")
if err != nil {
return err
}
defer f.Close()
if err := p.Write(f); err != nil {
return err
}
We work with a new, supplemented profile
Let’s try to open a new profile in pprof:
$ go tool pprof cpu2.out
Everything is working. GUI mode also works, you could already see the screenshot above (KDPV).
On the screenshot you can also see runtime.nilcheck
. Adding this information to your profile is not difficult, and the algorithm will be absolutely identical. Only the mark phase is different, where we try to match native code. Most often, nil check in Go looks like this: TESTB AX, (reg)
.
Conclusion
Adding Pseudo Functions to a Profile runtime.nilcheck
and runtime.boundcheck
implemented in qpprof. Full versions of source codes from examples can be found in the same place.
At this stage, I cannot say that we are getting relevant values, but at least we have learned to supplement the profiles with something new. If you use an alternative profiler for Go that has a higher hz for profiling, or is not sampling at all, then our chances of getting more accurate metrics increase. If these other profilers are able to export data in pprof format (profile.proto), then all the tricks from this article will be relevant. Otherwise, we can try to convert the profile to native Go format.
I have another secret use for such advanced profiles, but we’ll talk about that another time.