A guide to finding and fixing memory leaks in Go services

How to detect a leak?

It’s simple, look at the graph of changes in memory consumption by your service in the monitoring system.

The server is bad
The server is bad
The same service running with different settings
The same service running with different settings

All these pictures show that the graph of memory consumption only grows with time, but never decreases. The interval can be anything, look at the schedule for the day, week or month.

How to set up monitoring?

We will test our service locally without the help of DevOps engineers. To work, we need git, Docker and a terminal, and we will deploy a bunch of Grafana and Prometheus.

Grafana this is an interface for building beautiful graphs, charts and dashboards from them. Prometheus it is a system that includes a time series database and a special agent that collects metrics from your services.

In order to quickly deploy all this on a local machine, we will use a ready-made solution – https://github.com/vegasbrianc/prometheus

$ git clone git@github.com:vegasbrianc/prometheus.git
$ cd prometheus
$ HOSTNAME=$(hostname) docker stack deploy -c docker-stack.yml prom

After launching the link http://<Host IP Address>:3000 we should open Grafana. Read the README in the repository for details.

Prometheus client

Now we need to teach our service to return metrics, for this we need Prometheus client.

Sample code from the official repository

package main

import (
	"flag"
	"log"
	"net/http"

	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var addr = flag.String("listen-address", ":8080", "The address to listen on for HTTP requests.")

func main() {
	flag.Parse()
	http.Handle("/metrics", promhttp.Handler())
	log.Fatal(http.ListenAndServe(*addr, nil))
}

The most important lines are 8 and 15 in 90% of cases they will be enough.

"github.com/prometheus/client_golang/prometheus/promhttp"
...
http.Handle("/metrics", promhttp.Handler())
...

After starting, check that on the endpoint /metrics data appeared.

Add job to Prometheus agent

In the prometheus repository folder, find the prometheus.yml file, there is a scrape_configs section, add your job there

scrape_configs:
  - job_name: 'my-service'
    scrape_interval: 5s
    static_configs:
         - targets: ['192.168.1.4:8080']

194.168.1.4 this is the IP address of your local machine. You can find it out with the ipconfig command, ifconfig usually the interface is called en0

$ ifconfig
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	options=6463<RXCSUM,TXCSUM,TSO4,TSO6,CHANNEL_IO,PARTIAL_CSUM,ZEROINVERT_CSUM>
	ether f4:d4:88:7a:99:ce
	inet6 fe80::1413:b61f:c073:6a8e%en0 prefixlen 64 secured scopeid 0xe
	inet 192.168.1.4 netmask 0xffffff00 broadcast 192.168.1.255
	nd6 options=201<PERFORMNUD,DAD>
	media: autoselect
	status: active

Also, do not forget to change the launch IP address in your service, now it says ":8080" what actually means the IP address of the current machine 127.0.0.1to be sure, write "192.168.1.4:8080"

var addr = flag.String("listen-address", "192.168.1.4:8080", "The address to listen on for HTTP requests.")

Why write an IP?

The fact is that the docker stack in this configuration runs Grafana and Prometheus in its isolated network, they do not see your localhost, they have their own. Of course, you can run your service in the context of a docker network, but setting the direct IP of the local interface is the easiest way to make friends between containers running inside the docker network with your services running locally.

Applying settings

In the folder with the repository, execute the commands

$ docker stack rm prom
$ HOSTNAME=$(hostname) docker stack deploy -c docker-stack.yml prom

To make sure that everything worked after the restart, you can look at the prometheus targets panel at http://localhost:9090/

Setting up Grafana

Everything is simple here, you need to look for a beautiful dashboard for Go on the official website https://grafana.com/grafana/dashboards/ and import it yourself.

I personally liked this one https://grafana.com/grafana/dashboards/14061

Stress Testing

To detect a leak, the service needs to be loaded with real work. In web development, the main transport protocols between back and front are HTTP and Web Socket. There are many load testing utilities for them, for example

$ ab -n 10000 -kc 100 http://192.168.1.4:8080/endpoint
$ wrk -c 100 -d 10 -t 2 http://192.168.1.4:8080/endpoint

or the same Jmeter, but we will use artillery.io since we have web sockets and I needed to reproduce a certain scenario in order to catch a memory leak.

For artillery, you need node and npm, and since I once programmed in node.js, I like the project volta.shthis is something like a virtual environment in python, it allows each project to have its own version of node.js and its utilities, but the choice is yours.

We put artillery

$ npm install -g artillery@latest
$ artillery -v

Writing a load testing script test.yml

config:
    target: "ws://192.168.1.4:8080/v1/ws"
    phases:
        # - duration: 60
        #  arrivalRate: 5
        # - duration: 120
        #   arrivalRate: 5
        #   rampTo: 50
        - duration: 600
          arrivalRate: 50
scenarios:
    - engine: "ws"
      name: "Get current state"
      flow:
        - think: 0.5

The last phase adds 50 virtual users every second for ten minutes. Each of which sits and thinks for 0.5 seconds, or you can make some socket requests or even several.

Launch and view charts in Grafana

artillery run test.yml
Healthy person charts
Healthy person charts

What you should pay attention to

In addition to memory, it is worth looking at how your goroutines behave. Often problems are due to the fact that the routine remains “hanging” and all the memory allocated to it and all the variable pointers to which it got into it remain “hanging” as a dead weight in memory and are not deleted by the garbage collector. You will also see this on the chart. A banal example of a request handler in which a routine for “heavy” calculations is launched

func (s *Service) Debug(w http.ResponseWriter, r *http.Request) {
  go func() { ... }()
  w.WriteHeader(http.StatusOK)
  ...
}

This problem can be solved for example through context, waitGroup, errGroup

func (s *Service) Debug(w http.ResponseWriter, r *http.Request) {
  ctx, cancel := context.WithCancel(r.Context())
  go func() {
		for {
			select {
			case <-ctx.Done():
				return
			default:
			}
			...
		}
	}()
  w.WriteHeader(http.StatusOK)
  ...
}

Or, when registering a client, you allocate memory for it, for example, the data of its session, but do not release it when the client disconnected from you abnormally. Something like

type clientID string
type session struct {
  role string
  refreshToken string
  ...
}
var clients map[clientID]*clientSession

Watch how you pass arguments to a function, by pointer or by value. Do you use global variables inside packages? Do pointers to data structures hang in channels and routines? Graceful shutdown is about you?

It is impossible to give specific advice, each project is individual, but knowing how to independently set up monitoring and track a leak is a big step towards 99.9% stability

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *