Bugs We Write, Find, and Fix #2

Hello! My name is Denis, I am a developer of SEDMAX software. This is industrial software for collecting and visualizing data in the energy sector. Like everyone else, we have bugs. I would like to share my experience in finding such bugs, as well as discuss what needed to be done to prevent the bug from appearing. Our server part is written in go as a set of services, so the specifics of most bugs will be asynchronous interaction, and the example code is presented in go.

The previous article made the following conclusions:

It is worth separating asynchronous logic from business logic. Asynchronous code causes many problems when writing and analyzing, so it is worth highlighting patterns that are often used in development and writing a library that is tested from all sides and provides a convenient API for the developer.
Reflection sometimes dramatically slows down execution. As a rule, this is unnoticeable at low loads, but grows exponentially with increasingly large data. It is better to prefer generics for generalized algorithms that are weakly dependent on the type or generation for logic that is strongly tied to the properties of the transmitted type itself.
Do not use the -1 code in an enumeration value unless that value is one of the enumeration's value variants.
Be very careful about running goroutines inside handlers called by other programs, as they can call it independently of each other, simultaneously and many times.

In this part I will change the format of the article a little. I will write less about how the bug manifested itself for the user, and will immediately move on to the bug itself.

No pre-allocation – get ready for the brakes

This bug was described in the previous article and was offered to readers as an exercise in finding the problem. Unfortunately, no one in the comments tried to write what was wrong. The following code slowed down the execution very much.

func mergeData(data map[int][]byte, count int) ([]byte, bool) {
	result := make([]byte, 0)
	for i := 0; i < count; i++ {
		part, ok := data[i]
		if !ok {
			return nil, false
		}
		result = append(result, part...)
	}
	return result, true
}

You can see that there is a condition inside the cycle that allows you to exit the function. In this case, everything that has accumulated inside the result slice is discarded. But you can define the condition for exiting the function before starting the cycle. If the map size is less than count, then this condition will be fulfilled 100%, which means there is no point in collecting the slice. In our case, this function was called about 100 times and all but the last one collected information in vain. Moreover, when logging, it was clearly visible that the more parts were transferred to the function (they were increasing), the longer the function took to execute.

The second optimization is to pre-allocate the resulting slice. After all, its size is known in advance – the sum of all parts in the map. There is a prealloc linter that can suggest places where pre-allocation can be done, but it only works in simple cases. In this example, it did not work. Optimization example:

func mergeData(data map[int][]byte, count int) ([]byte, bool) {
	if len(data) != count {
		return nil, false
	}
	l := 0
	for _, v := range data {
		l += len(v)
	}
	result := make([]byte, 0, l)
	for i := 0; i < count; i++ {
		part, ok := data[i]
		if !ok {
			return nil, false
		}

		result = append(result, part...)
	}

	return result, true
}

The benchmark showed that the first optimization by map size noticeably accelerated from a minute to a couple of seconds. The second is not so noticeable.

Optimization with pre-allocation is a common case that developers can forget about. For example, there is an open library open62541, which is written in C and has many stars on github. We use it as the core of our opcua server. This library has a hash table that stores some nodes, which can be used to write our server data and request it by third-party programs via the opcua protocol. And everything was fine for small configurations. When the number of nodes grew to 1 million, it suddenly turned out that the server initialization takes 14 hours. And it's all about this storage. It does not have any method to pre-set its size, it is hard-coded with a number of only 64 and is specified as a #define macro, which means it can only be changed at the compilation stage. Sad. We have not yet decided what to do with this, but the problem is quite serious.

Sometimes it's worth looking into third-party library implementations when using

The redundancy logic in SEDMAX is partly built on the exchange of messages between servers via a NATS broker. Each server has its own broker, and each server sends messages to the NATS broker of the neighboring server and listens to its NATS broker to coordinate actions. To begin, here is a piece of code that determines the addresses of these brokers.

func (r *Replicator) update(nodes []Node) {
	urls := struct{ own, reserve string }{}
	for _, n := range nodes {
		if n.GetAddr() == r.Addr {
			urls.own = n.GetNats()
		} else {
			urls.reserve = n.GetNats()
		}
	}
    ...
}

These addresses are then used to connect to NATS and send messages. When a server receives a message from a neighboring server, it must perform certain actions. The previous article already talked about the “bell” signal. In short, this is a signal for updating internal caches in services. If a bell signal is triggered on one server, it is sent to the neighboring one. There it must be repeated, but not sent back to the first server. Otherwise, there will be a loop.

The bug is that the loop did happen, which loaded the system. Every cloud has a silver lining, thanks to this strange behavior we found weak spots in the services that did not allow themselves to be ddosed by this signal, and fixed them. But let's get back to the bug: how did this happen?

If you look closely at the code above, you can see that the strings in the urls structure are not guaranteed to be filled. It is quite possible that one or both of them will be empty strings. This is exactly what happened this time: urls.reserve turned out to be an empty string. This string is then used to create a NATS client. SEDMAX uses the go-micro framework. This is a library that simplifies working with go services. It turned out that when creating a client, it can accept an empty string. We expected that an empty string is not valid, and the created client will return an error when sending. But this is not the case. Somewhere in the depths of the implementation of this library, the empty string is replaced with localhost. This was not at all obvious, since the documentation does not mention such behavior anywhere. It is really hidden deep inside behind interfaces and asynchronous code. Getting to this piece of logic was difficult.

Total. We got an empty string as the URL of the backup server. When sending messages, the empty string was replaced by localhost, which means that messages were sent to oneself, which looped our bell. Not everyone studies the implementation of the libraries they use. And it is expensive. The library developers should describe this behavior in the comments to the API of this library, since this is really important information.

Synchronization is not complete

When working concurrently with data, it is necessary to protect it with synchronization primitives. Go even has a built-in race detector for the map structure, which will panic if during operation there is a simultaneous reading and writing to the map. Therefore, everyone has already gotten used to the fact that if a map is accessed in several goroutines, it must be protected with a mutex. Which is what was done in the go-micro framework we use. For example, to request data via grpc from a service, you only need to know its name and have a client with the necessary methods, which are generated via proto files. For us, as users of this library, a remote call looks like a method call. But a lot of interesting things happen inside.

First, you need to find a running service (node) by name. For this, there is an entity called the registry. It stores a list of nodes in a map, where the key is the service name. When a service starts, it registers, passing its IP and port for access. All services receive this information and save it in the registry. When a service finishes working, it deregisters, all services delete this node from their registry. Then, knowing the IP and port of a specific running service, a grpc request occurs. Everything is a little more complicated there, but the bug was in the registry, so I briefly described only it.

In the registry, the map with nodes is protected by a mutex, since it is accessed from many goroutines, where a remote call must be made. The map contains a slice of service nodes, which we receive. If a service is deregistered, then in another goroutine, the service node is removed from the slice. This is where the bug occurred, which is quite difficult to notice and manifests itself quite rarely. In go, a slice is a reference structure. It stores a pointer to data. If the data at this pointer changes, then it changes for all slices with this pointer. The sequence of the bug is as follows:

we request a list of nodes
First the mutex with the map is locked
the value (slice node) is returned to us for work
mutex is released
we work with the received slice
in parallel with point 5, a deregistration signal comes in another goroutine, the mutex with the map is blocked, the node is removed from the slice by key, the mutex is released

It turns out that in points 5 and 6, work with the same memory occurs in different goroutines, which led to unexpected consequences. The service simply hung, since it accessed an already deleted node.

To the credit of the developers, this bug has already been fixed in later versions of the library. It is necessary to copy the slice in point 4 before releasing the mutex, so that there are no accesses to the same memory area from different goroutines.

To catch such bugs at the testing stage, you need to run tests with the -race flag. Then during testing, a race detector works similar to the one in the map, but for the entire memory. Since this is a third-party library, this code was not covered by such tests in SEDMAX.

conclusions

It is better not to make unnecessary allocations or reallocations. On small volumes it is unnoticeable, but under heavy load it comes back to haunt you.
When using third-party libraries, it's best to make sure they work as expected for all argument variations.
Run tests with the -race flag to find timing errors.

The next article will be about how we fixed bugs that arose due to a large load on the influxdb time series database. Not much has been written about how this database behaves when the load increases. We had to pick up all the rakes ourselves.