Three Ways to Optimize Memory Management in Go with Memory Pools

What's the problem?

I came to Go development after 20 years of programming in C++ and Assembler, where my work was closely related to multimedia. In this area, data must be processed quickly. Imagine how much computation is needed to play a video in 4K resolution, decode and display 60 frames per second.

Since I had practical experience with how such optimization works at a low level in C/C++ programs, I was asked to optimize a Go program.

However, in the process it became clear that optimization in Go is very different from optimization in C++. In the case of C++, I could rely on the operating system managing memory and access to the hardware. In Go, this does not work quite like that.

The peculiarity of the application that I had to optimize was that 90% of the time it did not require large system resources, but once every 10 minutes it gave a peak load on the memory subsystem. In order to process megabytes coming over the network, it was necessary to distribute many data structures in the RAM. At this moment, something happened that most programmers are not prepared for.

Garbage Collector is known to trigger in the following scenarios:

  • Regularly – for example, once every two minutes – the GC “picks up” memory that is not in use.

  • Not on schedule, but when memory consumption reaches the maximum limit.

  • At the request of the developer.

You can see peak loads on the graph

You can see peak loads on the graph

In my case, when a lot of data came over the network to be processed, the GC recorded the peak memory consumption, stopped the program, freed the memory and continued to execute the program as if nothing had happened.

I didn't yet understand what was going on, so the first thing I did was run the profiler — fortunately, Go provides great tools along with compilers. That's how I learned that it wasn't some slow time-critical function that was consuming a lot of CPU and other resources. The profiler showed that the entire program was stopping when the GC was triggered.

Something had to be done about this. It was useless to speed up calculations by rewriting part of the code to SIMD instructions and removing the load from the CPU. The main bottleneck was the Go runtime's features for cleaning unused memory. I didn't want to raise the GC threshold and set the runtime to run unscheduled or force the GC to run more or less often than every two minutes. This could harm the program's operation even more.

A hypothesis arose that the problem could be solved by reducing the amount of memory consumed during peak loads. In addition to the banal reduction of fields in the stored structures and the transfer of common repeating information to separate variables, I found several other ways to optimize work with memory.

This article is based on a report from a Go meetup. On September 25, YADRO is holding another meeting of Go developers in St. Petersburg. Engineers will tell you how to choose a test framework, simplify the work of a platform team, and debug a service in production. You can join offline and online, participation is free, but need to register.

How Go Works with Memory and Why Optimization Is Still Necessary

Before we move on to the first method, let's look at how Go and Garbage Collector manage memory. Take a look at the primitive functions GetBytes and PutBytes. We'll use GetBytes to get some slice byte from the runtime, and PutBytes to return it.

func (p *NoPool) GetBytes() *[]byte {

	b := make([]byte, 0, ContentCap)

	return &b

}

func (p *NoPool) PutBytes(b *[]byte) {

	// just do nothing

}

When you want to allocate memory in a Go program, developers introduce a variable and then “forget” about it, hoping that GC will run and clean everything up. Everyone lives with this and is happy with it, until unusual situations arise, such as peak loads in my case.

In the code, GetBytes actually initializes a byte slice and returns a pointer to it to the caller, while PutBytes only pretends to return something somewhere. This approach relies entirely on the GC to run, which will determine that the allocated slice is no longer referenced, and return the slice's memory to the “free for further memory allocation” state.

Below we see the normal behavior of a Go program, without optimizations.

How memory is used by default between GC runs

How memory is used by default between GC runs

The problem is that in the time between GC runs, so much memory may be needed that the amount of memory used approaches the limit specified by the GOMEMLIMIT constant.

We see how the application is taking up memory, but the GC has not yet come to free it. The memory that will be freed during the next GC run is not yet available for reuse. That is, despite the fact that it is marked as free, we cannot use it until the GC runs. We will have to wait until the GC runs on schedule or when GOMEMLIMIT is reached, and returns the freed memory areas back to the list of available for use.

This is fine for undemanding programs, but if you need to actively work with memory, the method without optimization is not suitable. The only advantage of this solution is that the developer does not need to worry about memory, because the management is entirely up to the GC.

As a rule, optimizing memory work comes down to allocating a certain volume, and then using your own means to distribute and free fragments for variables. This removes the load from the GC, and it does not trigger upon reaching the limit specified by the GOMEMLIMIT constant. Let's look at some ways to do this.

Method 1: Start a Channel Pool

The first method that I found on the Internet and successfully tested is to use a buffered channel to store allocated memory fragments. Let's call this method of organizing a memory pool Channel Pool. With this implementation, GetBytes will wait for a free buffer to appear in the channel, and PutBytes will return a buffer that is no longer in use to the channel:

var chanPool = make(chan *[]byte, BuffCount)

func (p *ChanPool) GetBytes() *[]byte {
	select {
	case b, ok := <-chanPool:
		if ok {
			return b
		}
	default:
	}
	b := make([]byte, 0, ContentCap)
	return &b
}

func (p *ChanPool) PutBytes(b *[]byte) {
	if cap(*b) > ContentCap {
		return
	}
	*b = (*b)[:0]
	chanPool <- b
	return
}

Considering that the size of the buffered channel cannot be changed after initialization, it is necessary to try to predict how many memory buffers will be enough for the program to work. If the number of required buffers was not guessed when creating the channel, then sooner or later the program will experience hunger, and the pool organized on the channels will turn out to be a weak point in the algorithm.

How does a pool organized on channels work?

How does a pool organized on channels work?

The method works well, but it has its drawbacks.

It is not always possible to guess in advance how many buffers will be needed. One can assume that 10,000 will be enough, but sooner or later they will not be enough, and waiting for 10,001 buffers will lead to the algorithm stopping until an empty fragment is returned to the channel for further reuse.

Given the above, I continued searching for a more universal method that would not force developers to guess the size of the buffered channel.

Second way: store memory in sync.Pool

I kept searching and found the Pool struct in the sync package. It's a list that grows dynamically and can store any entity, including memory allocated in the heap.

If we make it so that the new operator for a variable of type sync.Pool will return a slice or other large structure, we will use it and be able to return it to this Pool. If during the next request for memory in the Pool there are no free places, the new operator will be called and allocate a new piece. In this regard, the memory pool organized on sync.Pool has no limitations.

type SyncPool struct{}

var heapBuffersPool = &sync.Pool{
	New: func() interface{} {
		b := make([]byte, 0, ContentCap)
		return &b
	},
}

func (p *SyncPool) GetBytes() *[]byte {
	return heapBuffersPool.Get().(*[]byte)
}

func (p *SyncPool) PutBytes(b *[]byte) {
	if cap(*b) > ContentCap {
		return
	}
	*b = (*b)[:0]
	heapBuffersPool.Put(b)
}

We have actually overcome the problem we encountered in implementing the memory pool on Channel.

However, do not forget: if the memory returned to sync.Pool is not currently in use, then the GC will come on schedule and pick up this memory. You should not expect that a piece of memory placed in sync.Pool will always be there.

So you shouldn't store data there. Some developers fall into this trap. They think that it's a list, and if so, then I'll put memory there now, and then sooner or later I'll take it away, and everything will be fine. No, it won't work that way. You can't use sync.Pool as a cache!

In the GIF you can see the new method, which allocates memory. It will be called every time the sync.Pool doesn't have enough elements for your request. As we can see, one thing won, another ran into another. GC still “pays attention” to this memory.

Method 3: Create a memory arena

After getting acquainted with sync.Pool, we migrated the application to it. It worked better: we no longer ran into the memory limit, but there is no limit to perfection. Go is evolving, and the language has an experimental feature – memory arena.

Memory arena is a piece of memory that GC doesn't pay attention to at all. When we switched to sync.Pool with memory in heap, the problem was that GC would come and clean something up. With sync.Pool with memory arena as a base for memory, we solved this problem, but there are some nuances.

The thing is, if you didn't guess the size of the buffer you requested from the memory arena, you can't just go and increase it. You need to guess the size of the buffer you will request right away, so that any desire you have to increase the memory fits into this size.

type ArenaPool struct{}

var a = arena.NewArena()

var arenaBuffersPool = &sync.Pool{
	New: func() interface{} {
		b := arena.MakeSlice[byte](a, 0, ContentCap)
		return &b
	},
}

func (p *ArenaPool) GetBytes() *[]byte {
	return arenaBuffersPool.Get().(*[]byte)
}

func (p *ArenaPool) PutBytes(b *[]byte) {
	if cap(*b) > ContentCap {
		return
	}
	*b = (*b)[:0]
	arenaBuffersPool.Put(b)
}

See how simple it is. Basically, there is no difference between GetBytes and PutBytes for the sync.Pool cases in the heap and in the memory arena. The only difference is how we implement the new method.

Benchmarks for all methods

For benchmarks, I requested memory allocation, but to prevent the compiler from over-optimizing, I randomly modified this piece of memory, then freed the memory in the case without optimization, or returned this memory back to the memory pool.

We see that normal memory handling without experiments with memory pools is an order of magnitude slower than other methods. Memory pool, organized on channels, is slightly slower than the other two. And sync.Pool on heap and sync.Pool on memory arena showed approximately the same results. The screenshot shows that memory arena is slightly faster, but I would attribute this to measurement error.

If you look at the flame graph below, you'll see that the method without optimization eats up a huge amount of memory and calls system functions many times. Other methods — sync.Pool on heap, sync.Pool on memory arena, and channel buffers — show similar results in terms of function calls and memory allocation.

Flame-graph

Flame-graph

How to return memory to pool

If you store memory in a pool as a slice, and during use, in order not to go beyond its boundaries, change the slice size, as a result, the slices may be of different sizes or will tend to occupy the maximum space. This applies to all memory pools, no matter what they are organized on. And over time, instead of 100 buffers of 10 kilobytes, you may get a pool of 100 buffers of 10 gigabytes.

This applies to all the memory pools considered, no matter what they are organized on. This should also always be remembered, so that the optimization of memory use does not involuntarily turn into its greedy use.

It is important to distinguish what memory and in what volume we return to any memory pool. That is, if we unconditionally return an overgrown piece of memory to the memory pool, sooner or later this will lead to the fact that memory consumption will exceed the expected memory size.

What to do? You need to change PutBytes a little. Empirically derive the memory limit for your project that will be allowed to be returned back to the pool. And, comparing the size of the returned memory with this constant, just forget about too large buffers that are trying to return to the pool.

This is exactly why a check for the size of the returned buffer has been added everywhere in PutBytes.

if cap(*b) > ContentCap {
		return
	}

A piece of memory that is not returned to the pool will sooner or later be released by the GC for general use. Nothing bad will happen. The most important thing is that they will not end up in the memory pool and will not end up taking up all the available memory.

How to choose a memory pool implementation

To choose an implementation that suits your project, consider how you work with memory. Below I have provided several examples of applications for which different memory pools are suitable.

Without memory management

If you do not use memory actively, you can avoid memory management altogether. In this case, the most you can do is analyze memory escapes and try to minimize their number. Memory allocation on the stack is dozens of times faster than in the heap, so this approach is the most effective for optimizing program performance. At the same time, you avoid unnecessary manipulations with sync.Pool or other similar methods.

Channel Pool

If you rarely use memory, but want to use it rationally, choose Channel Pool.

Let's say you're writing an editor with a video frame buffer that's used for smooth animation. A memory pool organized on channels would be a good choice for such a program, because you know the storage size in advance. And we remember that an important condition for Channel Pool is to know how many buffers to allocate for memory.

Channel Pool can be used in programs with fragmented memory load. Such programs experience peak loads, but a lot of time passes between them. And in order not to constantly occupy extra memory, we simply use pool on channels until memory is needed, and then remove it from the channel and delete the channel itself. In this case, the memory is returned to the Garbage Collector.

sync.Pool + memory in heap

However, there are situations when memory is allocated frequently and it is not known in advance how many buffers will be needed. In such cases, I recommend confidently using sync.Pool for memory management. This will not lead to performance problems, the program will function efficiently. The main condition is to constantly allocate memory.

sync.pool + memory arena

Let's consider the application of this approach using a video editor as an example. The program shows the user a video sequence, but not in real time. This happens when the user goes frame by frame: edits something, fills the timeline with special effects, etc.

For such programs, sync.Pool on memory arena is suitable. I will emphasize again that memory arena is good for cases when we do not want Garbage Collector to pick up this memory. And sync.Pool saves us from trying to guess in advance how many buffers we need.

Memory arena is also suitable when we know exactly what size buffer to allocate. Let's say a 4K frame is always a 4K frame. No need to guess whether to allocate half a frame or a frame plus a “tail”. Therefore, memory arena is the most suitable option.

Individual approach

But there are cases that do not fall under any of the options considered – fortunately, there are not very many of them. Here an individual approach is included.

Let's say some developers say: we use memory arena, sync.Pool and Channelpool are not suitable. In this case, you can “ask” 2 GB of memory from memory arena and organize your own memory management – the main thing is to think about how to do it. But I am sure that the optimization result will also be good.

Nuances to consider when developing your memory pool

  • The Garbage Collector cleans up buffers in sync.Pool that are not in use at the time it runs.

  • Uncontrolled growth of buffer sizes can lead to inefficient use of memory.

  • A memory leak is possible if you store copies of mutable objects instead of pointers.

  • The buffer can be returned even if it was given far away: you just need to “wrap” a pointer to PutBytes in the structure.

  • Go runtime changes frequently: with each new version, check your chosen approach with benchmarks.

Code repository where I compare memory pools implementations →

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *