We speed up hugo by 20% with a simple change in the reflect package

Finding a significant performance bottleneck in a standard library or mature application is rare.

I was surprised when in the top10 CPU profile list hugo during assembly digitalgov.gov the first position was the method reflect.Type.MethodByName().

      flat  flat%   sum%        cum   cum%
     8.84s  6.28%  6.28%     57.85s 41.10%  reflect.(*rtype).MethodByName
     7.93s  5.63% 11.92%      8.50s  6.04%  reflect.name.readVarint
     7.56s  5.37% 17.29%    111.79s 79.43%  reflect.Value.call
     7.53s  5.35% 22.64%     23.33s 16.58%  runtime.mallocgc
     7.29s  5.18% 27.82%     16.10s 11.44%  reflect.name.name

In this article I will tell you about how it happened and what could be done about it.

Why is MethodByName so slow

Let’s open the sources of MethodByName:

func (t *rtype) MethodByName(name string) (m Method, ok bool) {
    // <лишний код вырезан>

    // TODO(mdempsky): Binary search.
    for i, p := range ut.exportedMethods() {
        if t.nameOff(p.name).name() == name {
            return t.Method(i), true
        }
    }
    return Method{}, false
}

MethodByName uses a linear search. If the type has a lot of exported methods, then this can take a very long time, especially if the desired method is located somewhere at the end of the slice.

How can you speed up MethodByName

Option one: don’t use MethodByName. Or cache the result. But let’s assume that we can’t do that.

func (t *rtype) MethodByName(name string) (m Method, ok bool) {
    // <лишний код вырезан>

    if len(methods) > 8 {
        i := sort.Search(len(methods), func(i int) bool {
            return t.nameOff(methods[i].name).name() >= name
        })
        if i < len(methods) && t.nameOff(methods[i].name).name() == name {
            return t.Method(i), true
        }
    } else {
        for i, p := range methods {
            if t.nameOff(p.name).name() == name {
                return t.Method(i), true
            }
        }
    }
    return Method{}, false
}

We left linear search for small slices to avoid slowdowns. At larger sizes, binary search wins in almost every situation, except perhaps when the method you are looking for is at the very beginning of the slice.

If we write benchmarks, we get something like this:

name                           old time/op  new time/op  delta
MethodByName/4_first-8          426ns ± 2%   423ns ± 1%     ~   
MethodByName/4_last-8           482ns ± 1%   476ns ± 1%   -1.33%
MethodByName/4_nonexisting-8   57.8ns ± 0%  57.3ns ± 1%   -0.78%
MethodByName/16_first-8         427ns ± 1%   538ns ± 0%  +25.97%
MethodByName/16_last-8          643ns ± 1%   519ns ± 1%  -19.24%
MethodByName/16_nonexisting-8   194ns ± 0%   105ns ± 0%  -46.03%
MethodByName/32_first-8         429ns ± 1%   552ns ± 1%  +28.78%
MethodByName/32_last-8          831ns ± 0%   540ns ± 2%  -34.97%
MethodByName/32_nonexisting-8   396ns ± 1%   125ns ± 1%  -68.56%
MethodByName/64_first-8         431ns ± 1%   580ns ± 2%  +34.64%
MethodByName/64_last-8         1.36µs ± 1%  0.56µs ± 1%  -58.95%
MethodByName/64_nonexisting-8   759ns ± 0%   146ns ± 1%  -80.82%
[Geo mean]                      430ns        303ns       -29.58%

As expected, the new method performs worse for cases where the desired method is at the beginning of the slice. If we take into account that these are degenerate cases and, on average, the method will be somewhere in the middle, then the gain is quite obvious.

Checking impact on hugo

Initially, the above site was going for 20.4 seconds.

Build hugo with the updated package reflect and start building the site again.

Incredibly, building the site now takes 16.5 seconds!

And what about the profile?

(pprof) top 20
Showing nodes accounting for 50.85s, 44.61% of 114s total
Dropped 1174 nodes (cum <= 0.57s)
Showing top 20 nodes out of 194
      flat  flat%   sum%        cum   cum%
     8.64s  7.58%  7.58%     25.01s 21.94%  runtime.mallocgc
     7.69s  6.75% 14.32%     87.02s 76.33%  reflect.Value.call
     4.17s  3.66% 17.98%      5.39s  4.73%  runtime.heapBitsSetType
     2.88s  2.53% 20.51%         4s  3.51%  runtime.findObject
     2.83s  2.48% 22.99%      2.83s  2.48%  runtime.nextFreeFast (inline)
     2.79s  2.45% 25.44%      9.46s  8.30%  runtime.scanobject
     2.74s  2.40% 27.84%      2.74s  2.40%  runtime.duffcopy
     1.82s  1.60% 29.44%      1.88s  1.65%  runtime.pageIndexOf
     1.76s  1.54% 30.98%      1.76s  1.54%  runtime.memclrNoHeapPointers
     1.69s  1.48% 32.46%      3.24s  2.84%  runtime.mapaccess2
     1.66s  1.46% 33.92%      1.83s  1.61%  syscall.Syscall
     1.49s  1.31% 35.23%      1.61s  1.41%  reflect.name.readVarint
     1.43s  1.25% 36.48%      3.07s  2.69%  reflect.name.name
     1.42s  1.25% 37.73%     10.54s  9.25%  reflect.FuncOf
     1.37s  1.20% 38.93%      1.75s  1.54%  github.com/gohugoio/hugo/...
     1.34s  1.18% 40.11%      4.99s  4.38%  sync.(*Map).Load
     1.33s  1.17% 41.27%      1.33s  1.17%  reflect.(*rtype).Kind
     1.29s  1.13% 42.40%      1.29s  1.13%  cmpbody
     1.28s  1.12% 43.53%      1.28s  1.12%  runtime.resolveNameOff
     1.23s  1.08% 44.61%      1.30s  1.14%  runtime.heapBitsForAddr

MethodByName, which was in the 1st position, is now not included even in the top20.

But we have to dig deeper to draw any conclusions. Let’s count how many MethodByName calls there were and what is the distribution of the exportedMethods slice sizes and positions at which the method was found.

Total number of calls per site build: 17042238.

15920637 times | 120 methods
493669 times | 29 methods
227736 times | 16 methods
180098 times | 8 methods
143846 times | 39 methods
31371 times | 6 methods
14636 times | 1 methods
10933 times | 31 methods
10094 times | 9 methods
9026 times | 4 methods
102 times | 113 methods
85 times | 7 methods
4 times | 114 methods
1 times | 0 methods

Most of the calls were on a type that has 120 exported methods.

12998952 times | 98 index pos
1203864 times | 70 index pos
515978 times | 102 index pos
354191 times | 110 index pos
262965 times | 6 index pos
200819 times | 11 index pos
178457 times | 19 index pos
170806 times | 97 index pos
147510 times | 74 index pos
136927 times | 3 index pos
123071 times | 20 index pos
106250 times | 9 index pos
87190 times | 7 index pos
62767 times | 31 index pos
62192 times | 5 index pos
50170 times | 51 index pos
49120 times | 69 index pos
45325 times | 72 index pos
42236 times | 115 index pos
38646 times | 0 index pos
37106 times | 28 index pos
22744 times | 108 index pos
21521 times | 14 index pos
18134 times | 92 index pos
17341 times | 4 index pos
11535 times | 103 index pos
10791 times | 63 index pos
10303 times | 2 index pos
9170 times | 10 index pos
7278 times | 68 index pos
5936 times | 15 index pos
5466 times | 65 index pos
5385 times | 16 index pos
5172 times | 47 index pos
4478 times | 43 index pos
3032 times | 40 index pos
2340 times | 17 index pos
2265 times | 25 index pos
1516 times | 44 index pos
1068 times | 88 index pos
843 times | 1 index pos
709 times | 71 index pos
358 times | 109 index pos
100 times | 67 index pos
50 times | 64 index pos
40 times | 42 index pos
33 times | 34 index pos
17 times | 86 index pos
15 times | 83 index pos
15 times | 41 index pos
15 times | 33 index pos
12 times | 118 index pos
5 times | 8 index pos
5 times | 27 index pos
4 times | 66 index pos

The distribution of indexes of the searched methods also suggests that linear search in this case is not the best choice. Most of the methods were at position 98.

Underwater rocks

In the code above I have used sort.Searchbut it’s not available for package reflect.

The point is that the package sort imports reflect, so we get the cyclic imports error.

To avoid this, you can duplicate the implementation sort.Search in the package reflect or create a new internal package for Go, where there will be some subset of the package sort, independent of reflect.

This factor slightly complicates the implementation of the proposed patch, because the most beautiful implementation option is not available to us.

conclusions

I sent CL with changes in Go, but I don’t know if it will be accepted or not (spoiler: most likely not).

In any case, it was fun to see how such a seemingly innocuous operation slowed down the build of a static site for some significant time.

A little different I expected from hugo with the title “The world’s fastest framework for building websites.”

Similar Posts

Leave a Reply