How to make a Go program 42% faster by changing one character

If you read the title and thought “well, you must have done something stupid first”, then you are right! But what is programming if not an exercise in stupid mistakes? Finding stupid mistakes is the greatest pleasure!

It’s also worth making a note about benchmarking in advance: a 42% speedup was measured when running the program with my data and on my computer, so take this result with a grain of salt.

What does the program do?

codeowners

is a Go program that lists the owners of each file in a repository according to a set of rules specified in

файле GitHub CODEOWNERS

. The rule might say that all files with the extension

.go

owns team

@gophers

or that all the files in the folder

docs/

owns team

@docs

.

When parsing the specified path, the last rule that matches wins. A simple but naive matching algorithm iteratively traverses the rules for each path, stopping when a match is found. There are smarter algorithms, but this is a topic for a separate article. Here is what the function looks like

Ruleset.Match

:

type Ruleset []Rule

func (r Ruleset) Match(path string) (*Rule, error) {
  for i := len(r) - 1; i >= 0; i-- {
    rule := r[i]
    match, err := rule.Match(path)
    if match || err != nil {
      return &rule, err
    }
  }
  return nil, nil
}

Finding slow fragments with pprof and flamegraph

When processing a moderately large repository, my tool was a bit slow:

$ hyperfine codeowners
Benchmark 1: codeowners
  Time (mean ± σ):      4.221 s ±  0.089 s    [User: 5.862 s, System: 0.248 s]
  Range (min … max):    4.141 s …  4.358 s    10 runs

To understand what parts of the program spends its time, I recorded a CPU profile using pprof. You can generate a CPU profile by adding to the beginning of the function

main

the following code snippet:

pprofFile, pprofErr := os.Create("cpu.pprof")
if pprofErr != nil {
  log.Fatal(pprofErr)
}
pprof.StartCPUProfile(pprofFile)
defer pprof.StopCPUProfile()

Note: I use pprof quite a lot, so I saved this code as vscode snippet. I just enter pprofpress tab, and this snippet appears.

Go has a handy interactive profile visualization tool. I rendered the profile as a flamegraph by running the following command and then switching to flamegraph mode from the menu at the top of the page.

$ go tool pprof -http=":8000" ./codeowners ./cpu.pprof

As I expected, the program spent most of its time on the function

Match

. CODEOWNERS patterns compile to regular expressions, and most of the time the functions

Match

was spent on the regex engine of the Go language. But I also noticed that a lot of time was spent allocating and returning memory. The purple blocks in the flamegraph shown below match the pattern

gc|malloc

and it is clear that in total they make up a significant part of the program execution time.

Hunting Heap Allocations with Escape Analysis Traces

Let’s see if there are any allocations we can get rid of to reduce the GC pressure and reduce the time it takes

malloc

.

To figure out when some memory should continue to live on the heap, the Go compiler uses a technique called escape analysis. Let’s say a function initializes a struct and then returns a pointer to it. If a struct is allocated on the stack, the returned pointer will become invalid immediately after the function returns and the corresponding stack frame is invalidated. In this case, the Go compiler will determine that the pointer has “escaped” from the function and move the struct onto the heap.

You can see how these decisions are made by passing -gcflags=-m team go build:

$ go build -gcflags=-m *.go 2>&1 | grep codeowners.go
./codeowners.go:82:18: inlining call to os.IsNotExist
./codeowners.go:71:28: inlining call to filepath.Join
./codeowners.go:52:19: inlining call to os.Open
./codeowners.go:131:6: can inline Rule.Match
./codeowners.go:108:27: inlining call to Rule.Match
./codeowners.go:126:6: can inline Rule.RawPattern
./codeowners.go:155:6: can inline Owner.String
./codeowners.go:92:29: ... argument does not escape
./codeowners.go:96:33: string(output) escapes to heap
./codeowners.go:80:17: leaking param: path
./codeowners.go:70:31: []string{...} does not escape
./codeowners.go:71:28: ... argument does not escape
./codeowners.go:51:15: leaking param: path
./codeowners.go:105:7: leaking param content: r
./codeowners.go:105:24: leaking param: path
./codeowners.go:107:3: moved to heap: rule <<< --------- перемещение в кучу
./codeowners.go:126:7: leaking param: r to result ~r0 level=0
./codeowners.go:131:7: leaking param: r
./codeowners.go:131:21: leaking param: path
./codeowners.go:155:7: leaking param: o to result ~r0 level=0
./codeowners.go:159:13: "@" + o.Value escapes to heap

The output looks a bit noisy, but most of it can be ignored. Since we are looking for distributions, we should be concerned about the phrase

moved to heap

. Looking at the above code

Match

one can see that the structures

Rule

stored in a slice

Ruleset

, which we can be sure is already on the heap. And since a pointer to rule is returned, no additional allocations should be required.

And then I realized – assigning rule := r[i]we copy distributed on the heap Rule from the slice to the stack, and returning &rule, we create a pointer (“escaping”) to a copy of struct. Luckily, fixing this is easy. We just need to move the ampersand a little higher so that we get a reference to the struct in the slice instead of copying it:

 func (r Ruleset) Match(path string) (*Rule, error) {
 	for i := len(r) - 1; i >= 0; i-- {
		rule := &r[i] // вместо rule := r[i]
 		match, err := rule.Match(path)
 		if match || err != nil {
			return rule, err // вместо return &rule, err
 		}
 	}
 	return nil, nil
 }

I have considered two other approaches:

  1. transformation Ruleset from []Rule in []*Rulewhich would mean that we no longer need to explicitly obtain a reference to the rule.
  2. Return Rule instead of *Rule. It will still copy Rulebut it should stay on the stack, not be moved to the heap.

However, both of these approaches would lead to a breaking change since this method is part of the public API.

Be that as it may, after making this change, we can check if it had the desired effect by getting a new traceback from the compiler and comparing it with the old one:

$ diff trace-a trace-b
14a15
> ./codeowners.go:105:7: leaking param: r to result ~r0 level=0
16d16
< ./codeowners.go:107:3: moved to heap: rule

Success! Distribution is gone. Now let’s see how the removal of this one heap allocation affected performance:

$ hyperfine ./codeowners-a ./codeowners-b
Benchmark 1: ./codeowners-a
  Time (mean ± σ):      4.146 s ±  0.003 s    [User: 5.809 s, System: 0.249 s]
  Range (min … max):    4.139 s …  4.149 s    10 runs

Benchmark 2: ./codeowners-b
  Time (mean ± σ):      2.435 s ±  0.029 s    [User: 2.424 s, System: 0.026 s]
  Range (min … max):    2.413 s …  2.516 s    10 runs

Summary
  ./codeowners-b ran
    1.70 ± 0.02 times faster than ./codeowners-a

Since this distribution took place for

each matched path

, its removal gave in my case an increase in speed by 1.7 times (that is, the program started working 42% faster). Not a bad result for a single character change.

Similar Posts

Leave a Reply