from 1 minute 45 seconds to 4 seconds

A couple of weeks ago I read about something that really stuck with me Billion Row Processing Challengeso I wanted to solve it by Go.

I'm a little late, the competition was held in January. And in Java. I'm not particularly interested in Java, but I have been interested in code optimization for a long time Go.

This challenge was very simple: process a text file of weather station names and temperatures, and for each station output the minimum, average and maximum value. To simplify the task, there was also several restrictionshowever I have ignored the ones that are specific to Java.

Here are some lines with example input data:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
...

The only catch is that the input file consists of a billion lines. This is approximately 13 GB of data. I already figured out that disk I/O does not become a bottleneckusually slow down such memory allocation and parsing programs.

This article describes nine solutions I've written in Go, each faster than the last. The first, simple and idiomatic, takes 1 minute 45 seconds on my machine, and the last takes about 4 seconds. Along the way, I'll show you how I used the Go profiler to understand where my time was being spent.

Here is a list of solutions, from slowest to fastest:

  • r1: simple and idiomatic

  • r2: map with pointer values

  • r3: manual temperature parsing

  • r4: fixed point integers

  • r5: avoid bytes.Cut

  • r6: avoid bufio.Scanner

  • r7: special hash table

  • r8: parallelized r1

  • r9: parallelized r7

I wanted each solution to be portable Go code using only the standard library: no assembler, no unsafe and no memory-mapped files. For me, 4 seconds, or 3.2 GB/s, seemed like a fast enough result. By comparison, the fastest, highly optimized Java solution runs on my machine in less than a second – not bad!

There are already many ready-made solutions in Go and at least one good one article. My solution is faster than some others, but a little slower than fastest. However, before writing mine, I did not study others – I wanted my decisions to be independent.

If you are only interested in indicators, then go to the end of the article, there is a table with the results.

A starting point

Here are some initial indicators to understand what to focus on. Firstly, how long does it take to simply read 13 GB of data using cat:

$ time cat measurements.txt >/dev/null
0m1.052s

It's worth noting that this is the best out of five, meaning I allowed the file to be cached. Who knows if Linux allows the full 13GB to be stored in disk cache; presumably yes, because the first time it took almost 6 seconds.

For comparison, performing some operations is significantly slower: wc takes almost a minute:

$ time wc measurements.txt 
 1000000000  1179173106 13795293380 measurements.txt
0m55.710s

To create a simple solution to this problem, I'll probably start with AWK. In this decision Gawk is used because it is easier to sort the output with the function asorti. I used the option -bto apply the “characters as bytes” mode, which speeds things up a bit:

$ time gawk -b -f 1brc.awk measurements.txt >measurements.out
7m35.567s

I'm confident I can beat the 7 minute mark even with a simple Go solution, so let's start there.

I'll start by optimizing the sequential single-core version (solutions 1-7) and then parallelize it (solutions 8 and 9). All results are from Go 1.21.5 on a linux/amd64 laptop with a fast SSD and 32GB RAM.

Many of my solutions, and most of the fastest solutions, assume that the input is always valid. For example, that temperatures have exactly one decimal value. Many of my solutions will lead to runtime panic or incorrect results in case of invalid input data.

Solution 1: Simple and idiomatic Go

I wanted to first version was a simple, straightforward solution using only tools from the Go standard library: bufio.Scanner to read lines, strings.Cut to split by ';', strconv.ParseFloat for parsing temperatures and regular map Go to accumulate results.

First I will show the solution in its entirety, and then I will explain the most interesting parts:

func r1(inputPath string, output io.Writer) error {
    type stats struct {
        min, max, sum float64
        count         int64
    }

    f, err := os.Open(inputPath)
    if err != nil {
        return err
    }
    defer f.Close()

    stationStats := make(map[string]stats)

    scanner := bufio.NewScanner(f)
    for scanner.Scan() {
        line := scanner.Text()
        station, tempStr, hasSemi := strings.Cut(line, ";")
        if !hasSemi {
            continue
        }

        temp, err := strconv.ParseFloat(tempStr, 64)
        if err != nil {
            return err
        }

        s, ok := stationStats[station]
        if !ok {
            s.min = temp
            s.max = temp
            s.sum = temp
            s.count = 1
        } else {
            s.min = min(s.min, temp)
            s.max = max(s.max, temp)
            s.sum += temp
            s.count++
        }
        stationStats[station] = s
    }

    stations := make([]string, 0, len(stationStats))
    for station := range stationStats {
        stations = append(stations, station)
    }
    sort.Strings(stations)

    fmt.Fprint(output, "{")
    for i, station := range stations {
        if i > 0 {
            fmt.Fprint(output, ", ")
        }
        s := stationStats[station]
        mean := s.sum / float64(s.count)
        fmt.Fprintf(output, "%s=%.1f/%.1f/%.1f", station, s.min, mean, s.max)
    }
    fmt.Fprint(output, "}\n")
    return nil
}

This basic solution processes one billion rows in 1 minute 45 seconds. Definitely better than the 7 minute solution on AWK.

Solution 2: map with pointer values

I learned to create my own program count-words, which does a little more hashing than necessary. In each line, we hash the character string twice: first when we try to get the value from the map, and then when we update the map.

To avoid this you can use map[string]stats (pointer values) and update the pointer-addressed struct instead map[string]stats and updating the hash table itself.

However, first I wanted to verify this using the Go profiler. To add CPU profiling to a Go program, all you need to do is: several lines.

$ ./go-1brc -cpuprofile=cpu.prof -revision=1 measurements-10000000.txt >measurements-10000000.out
Processed 131.6MB in 965.888929ms
$ go tool pprof -http=: cpu.prof
...

These commands produced the following solution profile 1, run through a trimmed 10 million line input file:

Profile of solution r1

Solution profile r1

Map operations take up as much as 30% of the time: 12.24% for assignment and 17.35% for lookup. By using a pointer value, we should eliminate the bulk of the map assignment time.

Note: This profile picture also shows how the rest of your time is spent:

  • Scanning strings using Scanner.Scan

  • Search ';' with help strings.Cut

  • Temperature parsing using strconv.ParseFloat

  • Call Scanner.Textallocating a string of characters to a file line

As it were, my second solution was only a slight improvement in map operations:

stationStats := make(map[string]*stats)
scanner := bufio.NewScanner(f)
for scanner.Scan() {
    // ...
    s := stationStats[station]
    if s == nil {
        stationStats[station] = &stats{
            min:   temp,
            max:   temp,
            sum:   temp,
            count: 1,
        }
    } else {
        s.min = min(s.min, temp)
        s.max = max(s.max, temp)
        s.sum += temp
        s.count++
    }
}

In general, when a station exists in map, we now only perform one operation on map, s := stationStats[station]so that hashing the station name and accessing the hash table only needs to be done once. If it is already in the map (a common case for one billion rows), then we update the existing struct addressed by the pointer.

This doesn't really help, but it does give something: using pointer values ​​in map reduces execution time from 1 minute 45 seconds to 1 minute 31 seconds.

Solution 3: Avoid strconv.ParseFloat

IN third decision everything is finally getting more hardcore: we will parse the temperature using our own code, and not through strconv.ParseFloat. The standard library function handles a bunch of edge cases that for simple input data temperatures we don't need to support: we'll only have two or three digits in the format 1.2 or 34.5 (and some with a minus in front of the number).

Besides, strconv.ParseFloat gets argument stringand now that we don’t use it, we can get by with a byte slice directly from Scanner.Bytesrather than distributing and copying the string using Scanner.Text.

Now we parse the temperature like this:

negative := false
index := 0
if tempBytes[index] == '-' {
    index++
    negative = true
}
temp := float64(tempBytes[index] - '0') // парсим первую цифру
index++
if tempBytes[index] != '.' {
    temp = temp*10 + float64(tempBytes[index]-'0') // парсим опциональную вторую цифру
    index++
}
index++ // skip '.'
temp += float64(tempBytes[index]-'0') / 10 // парсим десятичную цифру
if negative {
    temp = -temp
}

It’s not particularly beautiful, but there’s nothing complicated either. So we reduced the time from 1 minute 31 seconds to less than a minute: 55.8 seconds.

Solution 4: Fixed point integers

In the old days, floating point instructions were much slower than integer instructions. They're only a little slower today, but it's probably worth avoiding them if possible.

In our problem, each temperature has one decimal place, so we can easily use integers with fixed point. For example, we can represent 34.5 as the integer 345. And only at the very end, before directly printing the results, do we convert them back to float.

That is mine fourth solutionis essentially the same as solution 3, but with the following struct field stats:

type stats struct {
    min, max, count int32
    sum             int64
}

Before outputting the results, it must be divided by 10:

mean := float64(s.sum) / float64(s.count) / 10
fmt.Fprintf(output, "%s=%.1f/%.1f/%.1f",
    station, float64(s.min)/10, mean, float64(s.max)/10)

For min and max temperatures I used 32-bit integers since the max would probably be around 500 (50 degrees Celsius). Can be used int16, but from previous experience I concluded that modern 64-bit CPUs are slightly slower with 16-bit integers than with 32-bit ones. In my tests they did not show any significant difference, but I still chose 32-bit.

Using integer reduced the time from 55.8 seconds to 51.0 seconds – a small gain.

Solution 5: Avoid bytes.Cut

To create solution 5I wrote down another profile (solution 4):

Profile of solution r4

r4 Solution Profile

So things got more complicated. Map operations dominate, and navigating to a specialized hash table can be a little confusing. So we'll get rid of bufio.Scanner. Let's procrastinate and get rid of bytes.Cut.

I thought it was an easy way to save time. Take a look at this example line:

New Orleans;11.7

It will be faster to parse the temperature from the end and find ';' there than scanning the entire station name in search of ';'. This rather ugly code does exactly that:

end := len(line)
tenths := int32(line[end-1] - '0')
ones := int32(line[end-3] - '0') // line[end-2] is '.'
var temp int32
var semicolon int
if line[end-4] == ';' {          // положительная температура N.N
    temp = ones*10 + tenths
    semicolon = end - 4
} else if line[end-4] == '-' {   // отрицательная температура -N.N
    temp = -(ones*10 + tenths)
    semicolon = end - 5
} else {
    tens := int32(line[end-4] - '0')
    if line[end-5] == ';' {      // положительная температура NN.N
        temp = tens*100 + ones*10 + tenths
        semicolon = end - 5
    } else {                     // отрицательная температура -NN.N
        temp = -(tens*100 + ones*10 + tenths)
        semicolon = end - 6
    }
}
station := line[:semicolon]

Refusing bytes.Cutwe dropped the time from 51.0 seconds to 46.0 seconds – another small victory.

Solution 6: Avoid bufio.Scanner

Now we will try to get rid of bufio.Scanner. Think about it: to find the end of each line in a file, the scanner has to go through all the bytes looking for the line break character. We then process many bytes again to parse the temperature and find ';'. So let's try to combine these steps and get rid of bufio.Scanner.

IN decision 6 we allocate a 1MB buffer to read the file in large blocks, look for the last line break in the block to make sure we don't cut the line in half, and then process each block. It looks like this:

buf := make([]byte, 1024*1024)
readStart := 0
for {
    n, err := f.Read(buf[readStart:])
    if err != nil && err != io.EOF {
        return err
    }
    if readStart+n == 0 {
        break
    }
    chunk := buf[:readStart+n]

    newline := bytes.LastIndexByte(chunk, '\n')
    if newline < 0 {
        break
    }
    remaining := chunk[newline+1:]
    chunk = chunk[:newline+1]

    for {
        station, after, hasSemi := bytes.Cut(chunk, []byte(";"))
        // ... дальше та же обработка температуры, что и в r4 ...

Eliminationbufio.Scanner and doing your own scan reduced the time from 46.0 seconds to 41.3 seconds. Another tiny victory, but we'll take it.

Solution 7: Custom hash table

On decision 7 the jokes are over. We will implement our own hash table instead map Go. This approach has two advantages:

  1. We will be able to hash the station name during the search process ';'avoiding repeated processing of bytes.

  2. We can store each key in our hash table as a byte slice, avoiding the need to convert each key to string (this does distribution and copying for each line of the file).

I wrote about how to implement hash table in Cbut I also implemented my own “counting” hash table in Gowhere I got this implementation from.

This is a simple implementation using a hashing algorithm FNV-1a With linear probing: If a collision occurs, the next empty slot is used.

To simplify, I pre-allocate a large number of hashbuckets (I used 100000) to avoid having to write page resizing logic. If the table is more than half full, the code will panic. I've measured that we'll get about 2% hash collisions.

This time there is much more code – preparing the hash table, hashing itself, probing the table and inserting:

// Структура хэш-таблицы:
type item struct {
    key  []byte
    stat *stats
}
items := make([]item, 100000) // хэш-бакеты с линейным зондированием
size := 0                     // количество активных элементов в срезе элементов

buf := make([]byte, 1024*1024)
readStart := 0
for {
    // ... то же разбиение на блоки, что и в r6 ...

    for {
        const (
            // 64-битные константы FNV-1 из hash/fnv.
            offset64 = 14695981039346656037
            prime64  = 1099511628211
        )

        // Хэшируем название станции и ищем ';'.
        var station, after []byte
        hash := uint64(offset64)
        i := 0
        for ; i < len(chunk); i++ {
            c := chunk[i]
            if c == ';' {
                station = chunk[:i]
                after = chunk[i+1:]
                break
            }
            hash ^= uint64(c) // FNV-1a is XOR then *
            hash *= prime64
        }
        if i == len(chunk) {
            break
        }

        // ... тот же парсинг температур, что и в r6 ...

        // Переходим к нужному бакету в хэш-таблице.
        hashIndex := int(hash & uint64(len(items)-1))
        for {
            if items[hashIndex].key == nil {
                // Найден пустой слот, добавляем новый элемент (копируем ключ).
                key := make([]byte, len(station))
                copy(key, station)
                items[hashIndex] = item{
                    key: key,
                    stat: &stats{
                        min:   temp,
                        max:   temp,
                        sum:   int64(temp),
                        count: 1,
                    },
                }
                size++
                if size > len(items)/2 {
                    panic("too many items in hash table")
                }
                break
            }
            if bytes.Equal(items[hashIndex].key, station) {
                // Найден совпадающий слот, прибавляем к имеющейся статистике.
                s := items[hashIndex].stat
                s.min = min(s.min, temp)
                s.max = max(s.max, temp)
                s.sum += int64(temp)
                s.count++
                break
            }
            // Слот уже содержит другой ключ, пробуем следующий слот (линейное зондирование).
            hashIndex++
            if hashIndex >= len(items) {
                hashIndex = 0
            }
        }
    }

    readStart = copy(buf, remaining)
}

The payoff from all this code is great: a special hash table reduces the time from 41.3 to 25.8 seconds.

Solution 8: Parallel block processing

IN decision 8 I wanted to add parallelism. However, I decided to revert to the simple and idiomatic code from the first solution using bufio.Scanner And strconv.ParseFloat, parallelizing it. So we will see what gives the best results, optimization or parallelization, and in the ninth solution we will implement both.

task map-reduce parallelization is very easy: split the file into blocks of similar size (one for each CPU core), start a thread (in a Go goroutine) to process each block, and at the end combine the results.

Here's what it looks like at a high level:

// Определяем непересекающиеся части для разбиения файла (каждая часть имеет смещение и размер).
parts, err := splitFile(inputPath, maxGoroutines)
if err != nil {
    return err
}

// Запускаем горутину для обработки каждой части, возвращая результаты в канал.
resultsCh := make(chan map[string]r8Stats)
for _, part := range parts {
    go r8ProcessPart(inputPath, part.offset, part.size, resultsCh)
}

// Ждём возврата результатов и агрегируем их.
totals := make(map[string]r8Stats)
for i := 0; i < len(parts); i++ {
    result := <-resultsCh
    for station, s := range result {
        ts, ok := totals[station]
        if !ok {
            totals[station] = r8Stats{
                min:   s.min,
                max:   s.max,
                sum:   s.sum,
                count: s.count,
            }
            continue
        }
        ts.min = min(ts.min, s.min)
        ts.max = max(ts.max, s.max)
        ts.sum += s.sum
        ts.count += s.count
        totals[station] = ts
    }
}

Function splitFile It's pretty boring, so I didn't include it here. It looks at the size of the file, divides it into the number of parts we need, and then searches each part by reading 100 bytes before the end and finding the last line break to ensure that each part ends with the full line of the file.

Function r8ProcessPartis essentially the same as the r1 solution, but it starts by going to the offset of the part and limits the length to the size of the part (using io.LimitedReader). When finished, it sends its own statistics map back to the channel:

func r8ProcessPart(inputPath string, fileOffset, fileSize int64,
                   resultsCh chan map[string]r8Stats) {
    file, err := os.Open(inputPath)
    if err != nil {
        panic(err)
    }
    defer file.Close()
    _, err = file.Seek(fileOffset, io.SeekStart)
    if err != nil {
        panic(err)
    }
    f := io.LimitedReader{R: file, N: fileSize}

    stationStats := make(map[string]r8Stats)

    scanner := bufio.NewScanner(&f)
    for scanner.Scan() {
        // ... та же обработка, что и в r1 ...
    }

    resultsCh <- stationStats
}

Parallel processing of the input file provides a significant gain over r1, reducing the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimized non-parallel” version (solution 7) took 25.8 seconds. That is, in this case, parallelization is a little faster than optimization, and also much simpler.

Solution 9: All optimizations plus parallelization

IN decision 9our latest attempt, we will simply combine all the optimizations from r1 to r7 with the parallelization implemented in r8.

I used the same function splitFile from r8, and the rest of the code was just copied from r7, so there's nothing new here. Except for the results: this final version dropped the time from 24.3 seconds to 3.99 seconds—a huge win.

Interestingly, since all the real processing is now in one big function r9ProcessPart, the profile graph is no longer particularly useful. Here's what it looks like now:

Profile of solution r9

r9 solution profile

As you can see, 82% of the time is spent on r9ProcessPart, bytes.Equal takes up 13%, and reading the file takes up the remaining 5%.

If we want to achieve more detailed profiling, we need to go deeper than the feature level that graph mode shows us and use source mode. Here's the inner loop:

Profile of solution r9 - source view

Profile of solution r9 – source view

This report confused me. Why does it show that if items[hashIndex].key == nil takes 5.01 s, but the call bytes.Equal only 390 ms shown? Finding a slice is much less expensive than calling a function, isn't it? If you're a Go performance geek and can help me interpret this, please reach out!

Anyway, I'm sure there are crazier optimizations we can come up with, but I decided to stop here. For me, processing a billion rows in 4 seconds, or 250 million rows per second, is quite enough.

Results table

Below is a table with all my Go solutions, as well as the fastest Go and Java solutions. Each result is the best of five runs of the solution with the same billion-line input file.

Version

Description

Time

Time relative to r1

r1

simple and idiomatic

1 min 45

1.00

r2

map with pointer values

1 min 31

1.15

r3

parsing temperatures manually

55.8 s

1.87

r4

fixed point integers

51.0 s

2.05

r5

getting rid of bytes.Cut

46.0 s

2.27

r6

getting rid of bufio.Scanner

41.3 s

2.53

r7

special hash table

25.8 s

4.05

r8

parallelized r1

24.3 s

4.31

r9

parallelized r7

3.99 s

26.2

AY

fastest version in Go

2.90 s

36.2

TW

fastest version in Java

0.953 s

110

I'm approximately in the same range as Go version by Alexandra Yastrebova (AY). His solution is similar to mine: we split the file into blocks, use a special hash table (he even used FNV hashing like me) and parse the temperature as an integer. However, it uses memory-mapped files, which I decided against for portability reasons. That's probably why his version is a little faster.

Thomas Württenger (with help other developers) created in the original challenge in Java the fastest solution. It runs in less than a second on my machine, four times faster than my Go version. It appears that he used unrolled loops, branchless parsing code, and other low-level tricks along with parallel processing and memory mapping.

It seems Thomas is the founder and important contributor to GraalVM (fast Java Virtual Machine with compilation before execution). So he's definitely an expert in this area. Great job, Thomas!

latest comments

Why is all this important?

For most everyday programming tasks, it's best to start with simple and idiomatic code. If you are calculating statistics for a billion temperatures and only need the answer once, then 1 minute 45 seconds will probably be enough for you.

But if you create a data processing pipeline and can speed up the code by four or even 26 times, then you will not only delight users, but you can also seriously save on computing costs – if the system is well loaded, then the computing costs can be 4 or 26 times less than the original!

And if you create a runtime environment like GraalVM or an interpreter like GoAWKthen this level of performance is very important: if you speed up the interpreter, then the user programs will also run much faster.

Plus, it's just fun to write code that gets the most out of the machine.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *