from 1 minute 45 seconds to 4 seconds
A couple of weeks ago I read about something that really stuck with me Billion Row Processing Challengeso I wanted to solve it by Go.
I'm a little late, the competition was held in January. And in Java. I'm not particularly interested in Java, but I have been interested in code optimization for a long time Go.
This challenge was very simple: process a text file of weather station names and temperatures, and for each station output the minimum, average and maximum value. To simplify the task, there was also several restrictionshowever I have ignored the ones that are specific to Java.
Here are some lines with example input data:
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
...
The only catch is that the input file consists of a billion lines. This is approximately 13 GB of data. I already figured out that disk I/O does not become a bottleneckusually slow down such memory allocation and parsing programs.
This article describes nine solutions I've written in Go, each faster than the last. The first, simple and idiomatic, takes 1 minute 45 seconds on my machine, and the last takes about 4 seconds. Along the way, I'll show you how I used the Go profiler to understand where my time was being spent.
Here is a list of solutions, from slowest to fastest:
r1: simple and idiomatic
r2: map with pointer values
r3: manual temperature parsing
r4: fixed point integers
r5: avoid
bytes.Cut
r6: avoid
bufio.Scanner
r7: special hash table
r8: parallelized r1
r9: parallelized r7
I wanted each solution to be portable Go code using only the standard library: no assembler, no unsafe
and no memory-mapped files. For me, 4 seconds, or 3.2 GB/s, seemed like a fast enough result. By comparison, the fastest, highly optimized Java solution runs on my machine in less than a second – not bad!
There are already many ready-made solutions in Go and at least one good one article. My solution is faster than some others, but a little slower than fastest. However, before writing mine, I did not study others – I wanted my decisions to be independent.
If you are only interested in indicators, then go to the end of the article, there is a table with the results.
A starting point
Here are some initial indicators to understand what to focus on. Firstly, how long does it take to simply read 13 GB of data using cat
:
$ time cat measurements.txt >/dev/null
0m1.052s
It's worth noting that this is the best out of five, meaning I allowed the file to be cached. Who knows if Linux allows the full 13GB to be stored in disk cache; presumably yes, because the first time it took almost 6 seconds.
For comparison, performing some operations is significantly slower: wc
takes almost a minute:
$ time wc measurements.txt
1000000000 1179173106 13795293380 measurements.txt
0m55.710s
To create a simple solution to this problem, I'll probably start with AWK. In this decision Gawk is used because it is easier to sort the output with the function asorti
. I used the option -b
to apply the “characters as bytes” mode, which speeds things up a bit:
$ time gawk -b -f 1brc.awk measurements.txt >measurements.out
7m35.567s
I'm confident I can beat the 7 minute mark even with a simple Go solution, so let's start there.
I'll start by optimizing the sequential single-core version (solutions 1-7) and then parallelize it (solutions 8 and 9). All results are from Go 1.21.5 on a linux/amd64 laptop with a fast SSD and 32GB RAM.
Many of my solutions, and most of the fastest solutions, assume that the input is always valid. For example, that temperatures have exactly one decimal value. Many of my solutions will lead to runtime panic or incorrect results in case of invalid input data.
Solution 1: Simple and idiomatic Go
I wanted to first version was a simple, straightforward solution using only tools from the Go standard library: bufio.Scanner
to read lines, strings.Cut
to split by ';'
, strconv.ParseFloat
for parsing temperatures and regular map
Go to accumulate results.
First I will show the solution in its entirety, and then I will explain the most interesting parts:
func r1(inputPath string, output io.Writer) error {
type stats struct {
min, max, sum float64
count int64
}
f, err := os.Open(inputPath)
if err != nil {
return err
}
defer f.Close()
stationStats := make(map[string]stats)
scanner := bufio.NewScanner(f)
for scanner.Scan() {
line := scanner.Text()
station, tempStr, hasSemi := strings.Cut(line, ";")
if !hasSemi {
continue
}
temp, err := strconv.ParseFloat(tempStr, 64)
if err != nil {
return err
}
s, ok := stationStats[station]
if !ok {
s.min = temp
s.max = temp
s.sum = temp
s.count = 1
} else {
s.min = min(s.min, temp)
s.max = max(s.max, temp)
s.sum += temp
s.count++
}
stationStats[station] = s
}
stations := make([]string, 0, len(stationStats))
for station := range stationStats {
stations = append(stations, station)
}
sort.Strings(stations)
fmt.Fprint(output, "{")
for i, station := range stations {
if i > 0 {
fmt.Fprint(output, ", ")
}
s := stationStats[station]
mean := s.sum / float64(s.count)
fmt.Fprintf(output, "%s=%.1f/%.1f/%.1f", station, s.min, mean, s.max)
}
fmt.Fprint(output, "}\n")
return nil
}
This basic solution processes one billion rows in 1 minute 45 seconds. Definitely better than the 7 minute solution on AWK.
Solution 2: map with pointer values
I learned to create my own program count-words, which does a little more hashing than necessary. In each line, we hash the character string twice: first when we try to get the value from the map, and then when we update the map.
To avoid this you can use map[string]stats
(pointer values) and update the pointer-addressed struct instead map[string]stats
and updating the hash table itself.
However, first I wanted to verify this using the Go profiler. To add CPU profiling to a Go program, all you need to do is: several lines.
$ ./go-1brc -cpuprofile=cpu.prof -revision=1 measurements-10000000.txt >measurements-10000000.out
Processed 131.6MB in 965.888929ms
$ go tool pprof -http=: cpu.prof
...
These commands produced the following solution profile 1, run through a trimmed 10 million line input file:
Map operations take up as much as 30% of the time: 12.24% for assignment and 17.35% for lookup. By using a pointer value, we should eliminate the bulk of the map assignment time.
Note: This profile picture also shows how the rest of your time is spent:
Scanning strings using
Scanner.Scan
Search
';'
with helpstrings.Cut
Temperature parsing using
strconv.ParseFloat
Call
Scanner.Text
allocating a string of characters to a file line
As it were, my second solution was only a slight improvement in map operations:
stationStats := make(map[string]*stats)
scanner := bufio.NewScanner(f)
for scanner.Scan() {
// ...
s := stationStats[station]
if s == nil {
stationStats[station] = &stats{
min: temp,
max: temp,
sum: temp,
count: 1,
}
} else {
s.min = min(s.min, temp)
s.max = max(s.max, temp)
s.sum += temp
s.count++
}
}
In general, when a station exists in map, we now only perform one operation on map, s := stationStats[station]
so that hashing the station name and accessing the hash table only needs to be done once. If it is already in the map (a common case for one billion rows), then we update the existing struct addressed by the pointer.
This doesn't really help, but it does give something: using pointer values in map reduces execution time from 1 minute 45 seconds to 1 minute 31 seconds.
Solution 3: Avoid strconv.ParseFloat
IN third decision everything is finally getting more hardcore: we will parse the temperature using our own code, and not through strconv.ParseFloat
. The standard library function handles a bunch of edge cases that for simple input data temperatures we don't need to support: we'll only have two or three digits in the format 1.2
or 34.5
(and some with a minus in front of the number).
Besides, strconv.ParseFloat
gets argument string
and now that we don’t use it, we can get by with a byte slice directly from Scanner.Bytes
rather than distributing and copying the string using Scanner.Text
.
Now we parse the temperature like this:
negative := false
index := 0
if tempBytes[index] == '-' {
index++
negative = true
}
temp := float64(tempBytes[index] - '0') // парсим первую цифру
index++
if tempBytes[index] != '.' {
temp = temp*10 + float64(tempBytes[index]-'0') // парсим опциональную вторую цифру
index++
}
index++ // skip '.'
temp += float64(tempBytes[index]-'0') / 10 // парсим десятичную цифру
if negative {
temp = -temp
}
It’s not particularly beautiful, but there’s nothing complicated either. So we reduced the time from 1 minute 31 seconds to less than a minute: 55.8 seconds.
Solution 4: Fixed point integers
In the old days, floating point instructions were much slower than integer instructions. They're only a little slower today, but it's probably worth avoiding them if possible.
In our problem, each temperature has one decimal place, so we can easily use integers with fixed point. For example, we can represent 34.5 as the integer 345. And only at the very end, before directly printing the results, do we convert them back to float.
That is mine fourth solutionis essentially the same as solution 3, but with the following struct field stats
:
type stats struct {
min, max, count int32
sum int64
}
Before outputting the results, it must be divided by 10:
mean := float64(s.sum) / float64(s.count) / 10
fmt.Fprintf(output, "%s=%.1f/%.1f/%.1f",
station, float64(s.min)/10, mean, float64(s.max)/10)
For min and max temperatures I used 32-bit integers since the max would probably be around 500 (50 degrees Celsius). Can be used int16
, but from previous experience I concluded that modern 64-bit CPUs are slightly slower with 16-bit integers than with 32-bit ones. In my tests they did not show any significant difference, but I still chose 32-bit.
Using integer reduced the time from 55.8 seconds to 51.0 seconds – a small gain.
Solution 5: Avoid bytes.Cut
To create solution 5I wrote down another profile (solution 4):
So things got more complicated. Map operations dominate, and navigating to a specialized hash table can be a little confusing. So we'll get rid of bufio.Scanner
. Let's procrastinate and get rid of bytes.Cut
.
I thought it was an easy way to save time. Take a look at this example line:
New Orleans;11.7
It will be faster to parse the temperature from the end and find ';'
there than scanning the entire station name in search of ';'
. This rather ugly code does exactly that:
end := len(line)
tenths := int32(line[end-1] - '0')
ones := int32(line[end-3] - '0') // line[end-2] is '.'
var temp int32
var semicolon int
if line[end-4] == ';' { // положительная температура N.N
temp = ones*10 + tenths
semicolon = end - 4
} else if line[end-4] == '-' { // отрицательная температура -N.N
temp = -(ones*10 + tenths)
semicolon = end - 5
} else {
tens := int32(line[end-4] - '0')
if line[end-5] == ';' { // положительная температура NN.N
temp = tens*100 + ones*10 + tenths
semicolon = end - 5
} else { // отрицательная температура -NN.N
temp = -(tens*100 + ones*10 + tenths)
semicolon = end - 6
}
}
station := line[:semicolon]
Refusing bytes.Cut
we dropped the time from 51.0 seconds to 46.0 seconds – another small victory.
Solution 6: Avoid bufio.Scanner
Now we will try to get rid of bufio.Scanner
. Think about it: to find the end of each line in a file, the scanner has to go through all the bytes looking for the line break character. We then process many bytes again to parse the temperature and find ';'
. So let's try to combine these steps and get rid of bufio.Scanner
.
IN decision 6 we allocate a 1MB buffer to read the file in large blocks, look for the last line break in the block to make sure we don't cut the line in half, and then process each block. It looks like this:
buf := make([]byte, 1024*1024)
readStart := 0
for {
n, err := f.Read(buf[readStart:])
if err != nil && err != io.EOF {
return err
}
if readStart+n == 0 {
break
}
chunk := buf[:readStart+n]
newline := bytes.LastIndexByte(chunk, '\n')
if newline < 0 {
break
}
remaining := chunk[newline+1:]
chunk = chunk[:newline+1]
for {
station, after, hasSemi := bytes.Cut(chunk, []byte(";"))
// ... дальше та же обработка температуры, что и в r4 ...
Eliminationbufio.Scanner
and doing your own scan reduced the time from 46.0 seconds to 41.3 seconds. Another tiny victory, but we'll take it.
Solution 7: Custom hash table
On decision 7 the jokes are over. We will implement our own hash table instead map
Go. This approach has two advantages:
We will be able to hash the station name during the search process
';'
avoiding repeated processing of bytes.We can store each key in our hash table as a byte slice, avoiding the need to convert each key to
string
(this does distribution and copying for each line of the file).
I wrote about how to implement hash table in Cbut I also implemented my own “counting” hash table in Gowhere I got this implementation from.
This is a simple implementation using a hashing algorithm FNV-1a With linear probing: If a collision occurs, the next empty slot is used.
To simplify, I pre-allocate a large number of hashbuckets (I used 100000) to avoid having to write page resizing logic. If the table is more than half full, the code will panic. I've measured that we'll get about 2% hash collisions.
This time there is much more code – preparing the hash table, hashing itself, probing the table and inserting:
// Структура хэш-таблицы:
type item struct {
key []byte
stat *stats
}
items := make([]item, 100000) // хэш-бакеты с линейным зондированием
size := 0 // количество активных элементов в срезе элементов
buf := make([]byte, 1024*1024)
readStart := 0
for {
// ... то же разбиение на блоки, что и в r6 ...
for {
const (
// 64-битные константы FNV-1 из hash/fnv.
offset64 = 14695981039346656037
prime64 = 1099511628211
)
// Хэшируем название станции и ищем ';'.
var station, after []byte
hash := uint64(offset64)
i := 0
for ; i < len(chunk); i++ {
c := chunk[i]
if c == ';' {
station = chunk[:i]
after = chunk[i+1:]
break
}
hash ^= uint64(c) // FNV-1a is XOR then *
hash *= prime64
}
if i == len(chunk) {
break
}
// ... тот же парсинг температур, что и в r6 ...
// Переходим к нужному бакету в хэш-таблице.
hashIndex := int(hash & uint64(len(items)-1))
for {
if items[hashIndex].key == nil {
// Найден пустой слот, добавляем новый элемент (копируем ключ).
key := make([]byte, len(station))
copy(key, station)
items[hashIndex] = item{
key: key,
stat: &stats{
min: temp,
max: temp,
sum: int64(temp),
count: 1,
},
}
size++
if size > len(items)/2 {
panic("too many items in hash table")
}
break
}
if bytes.Equal(items[hashIndex].key, station) {
// Найден совпадающий слот, прибавляем к имеющейся статистике.
s := items[hashIndex].stat
s.min = min(s.min, temp)
s.max = max(s.max, temp)
s.sum += int64(temp)
s.count++
break
}
// Слот уже содержит другой ключ, пробуем следующий слот (линейное зондирование).
hashIndex++
if hashIndex >= len(items) {
hashIndex = 0
}
}
}
readStart = copy(buf, remaining)
}
The payoff from all this code is great: a special hash table reduces the time from 41.3 to 25.8 seconds.
Solution 8: Parallel block processing
IN decision 8 I wanted to add parallelism. However, I decided to revert to the simple and idiomatic code from the first solution using bufio.Scanner
And strconv.ParseFloat
, parallelizing it. So we will see what gives the best results, optimization or parallelization, and in the ninth solution we will implement both.
task map-reduce parallelization is very easy: split the file into blocks of similar size (one for each CPU core), start a thread (in a Go goroutine) to process each block, and at the end combine the results.
Here's what it looks like at a high level:
// Определяем непересекающиеся части для разбиения файла (каждая часть имеет смещение и размер).
parts, err := splitFile(inputPath, maxGoroutines)
if err != nil {
return err
}
// Запускаем горутину для обработки каждой части, возвращая результаты в канал.
resultsCh := make(chan map[string]r8Stats)
for _, part := range parts {
go r8ProcessPart(inputPath, part.offset, part.size, resultsCh)
}
// Ждём возврата результатов и агрегируем их.
totals := make(map[string]r8Stats)
for i := 0; i < len(parts); i++ {
result := <-resultsCh
for station, s := range result {
ts, ok := totals[station]
if !ok {
totals[station] = r8Stats{
min: s.min,
max: s.max,
sum: s.sum,
count: s.count,
}
continue
}
ts.min = min(ts.min, s.min)
ts.max = max(ts.max, s.max)
ts.sum += s.sum
ts.count += s.count
totals[station] = ts
}
}
Function splitFile
It's pretty boring, so I didn't include it here. It looks at the size of the file, divides it into the number of parts we need, and then searches each part by reading 100 bytes before the end and finding the last line break to ensure that each part ends with the full line of the file.
Function r8ProcessPart
is essentially the same as the r1 solution, but it starts by going to the offset of the part and limits the length to the size of the part (using io.LimitedReader
). When finished, it sends its own statistics map back to the channel:
func r8ProcessPart(inputPath string, fileOffset, fileSize int64,
resultsCh chan map[string]r8Stats) {
file, err := os.Open(inputPath)
if err != nil {
panic(err)
}
defer file.Close()
_, err = file.Seek(fileOffset, io.SeekStart)
if err != nil {
panic(err)
}
f := io.LimitedReader{R: file, N: fileSize}
stationStats := make(map[string]r8Stats)
scanner := bufio.NewScanner(&f)
for scanner.Scan() {
// ... та же обработка, что и в r1 ...
}
resultsCh <- stationStats
}
Parallel processing of the input file provides a significant gain over r1, reducing the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimized non-parallel” version (solution 7) took 25.8 seconds. That is, in this case, parallelization is a little faster than optimization, and also much simpler.
Solution 9: All optimizations plus parallelization
IN decision 9our latest attempt, we will simply combine all the optimizations from r1 to r7 with the parallelization implemented in r8.
I used the same function splitFile
from r8, and the rest of the code was just copied from r7, so there's nothing new here. Except for the results: this final version dropped the time from 24.3 seconds to 3.99 seconds—a huge win.
Interestingly, since all the real processing is now in one big function r9ProcessPart
, the profile graph is no longer particularly useful. Here's what it looks like now:
As you can see, 82% of the time is spent on r9ProcessPart
, bytes.Equal
takes up 13%, and reading the file takes up the remaining 5%.
If we want to achieve more detailed profiling, we need to go deeper than the feature level that graph mode shows us and use source mode. Here's the inner loop:
This report confused me. Why does it show that if items[hashIndex].key == nil
takes 5.01 s, but the call bytes.Equal
only 390 ms shown? Finding a slice is much less expensive than calling a function, isn't it? If you're a Go performance geek and can help me interpret this, please reach out!
Anyway, I'm sure there are crazier optimizations we can come up with, but I decided to stop here. For me, processing a billion rows in 4 seconds, or 250 million rows per second, is quite enough.
Results table
Below is a table with all my Go solutions, as well as the fastest Go and Java solutions. Each result is the best of five runs of the solution with the same billion-line input file.
Version | Description | Time | Time relative to r1 |
---|---|---|---|
r1 | simple and idiomatic | 1 min 45 | 1.00 |
r2 | map with pointer values | 1 min 31 | 1.15 |
r3 | parsing temperatures manually | 55.8 s | 1.87 |
r4 | fixed point integers | 51.0 s | 2.05 |
r5 | getting rid of | 46.0 s | 2.27 |
r6 | getting rid of | 41.3 s | 2.53 |
r7 | special hash table | 25.8 s | 4.05 |
r8 | parallelized r1 | 24.3 s | 4.31 |
r9 | parallelized r7 | 3.99 s | 26.2 |
fastest version in Go | 2.90 s | 36.2 | |
fastest version in Java | 0.953 s | 110 |
I'm approximately in the same range as Go version by Alexandra Yastrebova (AY). His solution is similar to mine: we split the file into blocks, use a special hash table (he even used FNV hashing like me) and parse the temperature as an integer. However, it uses memory-mapped files, which I decided against for portability reasons. That's probably why his version is a little faster.
Thomas Württenger (with help other developers) created in the original challenge in Java the fastest solution. It runs in less than a second on my machine, four times faster than my Go version. It appears that he used unrolled loops, branchless parsing code, and other low-level tricks along with parallel processing and memory mapping.
It seems Thomas is the founder and important contributor to GraalVM (fast Java Virtual Machine with compilation before execution). So he's definitely an expert in this area. Great job, Thomas!
latest comments
Why is all this important?
For most everyday programming tasks, it's best to start with simple and idiomatic code. If you are calculating statistics for a billion temperatures and only need the answer once, then 1 minute 45 seconds will probably be enough for you.
But if you create a data processing pipeline and can speed up the code by four or even 26 times, then you will not only delight users, but you can also seriously save on computing costs – if the system is well loaded, then the computing costs can be 4 or 26 times less than the original!
And if you create a runtime environment like GraalVM or an interpreter like GoAWKthen this level of performance is very important: if you speed up the interpreter, then the user programs will also run much faster.
Plus, it's just fun to write code that gets the most out of the machine.