How to make a Go program 42% faster by changing one character

If you read the title and thought “well, you must have done something stupid first”, then you are right! But what is programming if not an exercise in stupid mistakes? Finding stupid mistakes is the greatest pleasure!
It’s also worth making a note about benchmarking in advance: a 42% speedup was measured when running the program with my data and on my computer, so take this result with a grain of salt.
What does the program do?
is a Go program that lists the owners of each file in a repository according to a set of rules specified in
. The rule might say that all files with the extension
.go
owns team
@gophers
or that all the files in the folder
docs/
owns team
@docs
.
When parsing the specified path, the last rule that matches wins. A simple but naive matching algorithm iteratively traverses the rules for each path, stopping when a match is found. There are smarter algorithms, but this is a topic for a separate article. Here is what the function looks like
Ruleset.Match
:
type Ruleset []Rule
func (r Ruleset) Match(path string) (*Rule, error) {
for i := len(r) - 1; i >= 0; i-- {
rule := r[i]
match, err := rule.Match(path)
if match || err != nil {
return &rule, err
}
}
return nil, nil
}
Finding slow fragments with pprof and flamegraph
When processing a moderately large repository, my tool was a bit slow:
$ hyperfine codeowners
Benchmark 1: codeowners
Time (mean ± σ): 4.221 s ± 0.089 s [User: 5.862 s, System: 0.248 s]
Range (min … max): 4.141 s … 4.358 s 10 runs
To understand what parts of the program spends its time, I recorded a CPU profile using pprof. You can generate a CPU profile by adding to the beginning of the function
main
the following code snippet:
pprofFile, pprofErr := os.Create("cpu.pprof")
if pprofErr != nil {
log.Fatal(pprofErr)
}
pprof.StartCPUProfile(pprofFile)
defer pprof.StopCPUProfile()
Note: I use pprof quite a lot, so I saved this code as vscode snippet. I just enter pprof
press tab, and this snippet appears.
Go has a handy interactive profile visualization tool. I rendered the profile as a flamegraph by running the following command and then switching to flamegraph mode from the menu at the top of the page.
$ go tool pprof -http=":8000" ./codeowners ./cpu.pprof
As I expected, the program spent most of its time on the function
Match
. CODEOWNERS patterns compile to regular expressions, and most of the time the functions
Match
was spent on the regex engine of the Go language. But I also noticed that a lot of time was spent allocating and returning memory. The purple blocks in the flamegraph shown below match the pattern
gc|malloc
and it is clear that in total they make up a significant part of the program execution time.

Hunting Heap Allocations with Escape Analysis Traces
Let’s see if there are any allocations we can get rid of to reduce the GC pressure and reduce the time it takes
malloc
.
To figure out when some memory should continue to live on the heap, the Go compiler uses a technique called escape analysis. Let’s say a function initializes a struct and then returns a pointer to it. If a struct is allocated on the stack, the returned pointer will become invalid immediately after the function returns and the corresponding stack frame is invalidated. In this case, the Go compiler will determine that the pointer has “escaped” from the function and move the struct onto the heap.
You can see how these decisions are made by passing -gcflags=-m
team go build
:
$ go build -gcflags=-m *.go 2>&1 | grep codeowners.go
./codeowners.go:82:18: inlining call to os.IsNotExist
./codeowners.go:71:28: inlining call to filepath.Join
./codeowners.go:52:19: inlining call to os.Open
./codeowners.go:131:6: can inline Rule.Match
./codeowners.go:108:27: inlining call to Rule.Match
./codeowners.go:126:6: can inline Rule.RawPattern
./codeowners.go:155:6: can inline Owner.String
./codeowners.go:92:29: ... argument does not escape
./codeowners.go:96:33: string(output) escapes to heap
./codeowners.go:80:17: leaking param: path
./codeowners.go:70:31: []string{...} does not escape
./codeowners.go:71:28: ... argument does not escape
./codeowners.go:51:15: leaking param: path
./codeowners.go:105:7: leaking param content: r
./codeowners.go:105:24: leaking param: path
./codeowners.go:107:3: moved to heap: rule <<< --------- перемещение в кучу
./codeowners.go:126:7: leaking param: r to result ~r0 level=0
./codeowners.go:131:7: leaking param: r
./codeowners.go:131:21: leaking param: path
./codeowners.go:155:7: leaking param: o to result ~r0 level=0
./codeowners.go:159:13: "@" + o.Value escapes to heap
The output looks a bit noisy, but most of it can be ignored. Since we are looking for distributions, we should be concerned about the phrase
moved to heap
. Looking at the above code
Match
one can see that the structures
Rule
stored in a slice
Ruleset
, which we can be sure is already on the heap. And since a pointer to rule is returned, no additional allocations should be required.
And then I realized – assigning rule := r[i]
we copy distributed on the heap Rule
from the slice to the stack, and returning &rule
, we create a pointer (“escaping”) to a copy of struct. Luckily, fixing this is easy. We just need to move the ampersand a little higher so that we get a reference to the struct in the slice instead of copying it:
func (r Ruleset) Match(path string) (*Rule, error) {
for i := len(r) - 1; i >= 0; i-- {
rule := &r[i] // вместо rule := r[i]
match, err := rule.Match(path)
if match || err != nil {
return rule, err // вместо return &rule, err
}
}
return nil, nil
}
I have considered two other approaches:
- transformation
Ruleset
from[]Rule
in[]*Rule
which would mean that we no longer need to explicitly obtain a reference to the rule. - Return
Rule
instead of*Rule
. It will still copyRule
but it should stay on the stack, not be moved to the heap.
However, both of these approaches would lead to a breaking change since this method is part of the public API.
Be that as it may, after making this change, we can check if it had the desired effect by getting a new traceback from the compiler and comparing it with the old one:
$ diff trace-a trace-b
14a15
> ./codeowners.go:105:7: leaking param: r to result ~r0 level=0
16d16
< ./codeowners.go:107:3: moved to heap: rule
Success! Distribution is gone. Now let’s see how the removal of this one heap allocation affected performance:
$ hyperfine ./codeowners-a ./codeowners-b
Benchmark 1: ./codeowners-a
Time (mean ± σ): 4.146 s ± 0.003 s [User: 5.809 s, System: 0.249 s]
Range (min … max): 4.139 s … 4.149 s 10 runs
Benchmark 2: ./codeowners-b
Time (mean ± σ): 2.435 s ± 0.029 s [User: 2.424 s, System: 0.026 s]
Range (min … max): 2.413 s … 2.516 s 10 runs
Summary
./codeowners-b ran
1.70 ± 0.02 times faster than ./codeowners-a
Since this distribution took place for
each matched path
, its removal gave in my case an increase in speed by 1.7 times (that is, the program started working 42% faster). Not a bad result for a single character change.