Advanced Redis Structures
In numbers, the system receives about 3,000,000 incoming metrics per second, of which about 20,000 metrics per second are saved. Every minute, 26,000 triggers are checked. On average, we send 2,000 notifications per day.
How to use Redis
Redis is a very popular DBMS, and you have probably encountered it in one way or another. Data in Redis is stored in a key-value format: there are no relational links or indexes. Also, all this data is stored in RAM, which ensures high performance. For these two reasons, Redis is most often used not as a full-fledged database, but as a cache for temporary storage of hot values.
However, when working with Redis, it is important not to forget about another of its distinctive properties: it is single-threaded. This means that all data reading and writing operations occur strictly synchronously in one execution thread. Naturally, auxiliary threads are used to maintain a large number of network connections and ease the load on IO. But the read/write thread remains the main bottleneck.
To start using Redis in golang, it is enough to take one of the libraries and initialize the client: Moira uses the go-redis module and UniversalClient from it. The configuration of the universal one hides many non-obvious nuances: for example, without the specified ReadOnly field, reading from slave nodes of the Redis cluster will not work, and properly configured timeouts and retries will allow you to survive master re-elections without losses when one of the servers is unavailable.
import "github.com/go-redis/redis/v8"
c := redis.NewUniversalClient(&redis.UniversalOptions{
// Список адресов машин кластера
Addrs: config.Addrs,
Username: config.Username,
Password: config.Password,
// Для работы с Sentinel
MasterName: config.MasterName,
SentinelPassword: config.SentinelPassword,
SentinelUsername: config.SentinelUsername,
})
The universal client allows the same code to work with different types of Redis installations:
a single server that is convenient to use for local testing
Redis Sentinel with replication support
and Redis Cluster, which supports replication and sharding
Basic Redis Commands
SET key value [EX seconds]
result, err := c.Set(ctx, key, value, ttl).Result()
Saves a string value by key. You can set a timeout: the time after which the key will be automatically deleted from the database.
GET key
result, err := c.Get(ctx, key).Result()
GET key retrieves a string value by key.
SADD key member
result, err := c.SAdd(ctx, key, member).Result()
Adds a value to the set stored by key. If such a set does not exist yet, it creates it. But SADD key member does not allow you to set a timeout: you must delete values from sets yourself.
SREM key member [member…]
result, err := c.SRem(ctx, key, members//.).Result()
Removes a value from the set stored by the key. If this was the last value, the key itself will also be removed.
A small lyrical digression. The following commands are related to ZSet, so I will tell you about it briefly. ZSet is an ordered set. In my opinion, it is ideal for selecting data by time range, for example, metrics. But it can also be used as a priority queue, for example, for notifications or events. And what is also important: ZSet is implemented not through a tree, as one might think, but through a hashmap with an additional skip index.
ZADD key member
result, err := c.ZAdd(ctx, key, member).Result()
ZADD and ZREM work similarly to SADD and SREM, but with ordered sets.
ZRANGE key start stop [BYSCORE | BYLEX] [REV] [LIMIT offset count]
result, err := c.ZRangeByScore(ctx, key, &redis.ZRangeBy{
Min: strconv.FormatInt(from, 10),
Max: strconv.FormatInt(to, 10),
Count: int64(limitCount),
Offset: int64(limitOffset),
}).Result()
Quickly selects values in a given range from an ordered set. You can additionally specify the number and shift of values to implement, for example, pagination.
Pros and cons of communicating via Redis
Moira consists of dozens of microservices in Kubernetes that do not communicate with each other over the network. All communication occurs through the Redis cluster.
Pros:
The network inside the cluster will die, and we won't care
All intermediate states are saved in the database
Cons:
Every sneeze requires re-deploying all services
There are no relational database features without which it is difficult to migrate and clean up data.
I'll tell you now how we solve problems with migration and data cleaning.
Data migration
Redis doesn't provide any special tools for data migration, so to do it you have to take everything out and put it back again. Unfortunately, there is no easy way to ensure that:
– no one will try to change the data right at the moment of migration
– all services will wait for the migration
– and that the consistency of relationships between data will be maintained during migration.
And our data scheme looks something like this:
So before each data migration, we ask ourselves one, but extremely important question: “Do we REALLY need to migrate them?”
If yes, then:
We warn users
Enable Moira's read-only mode
Migrating data
If not, then we write code that is backward compatible with the current database schema.
For example, backward compatibility can be achieved like this. We have an old version of the trigger, in which the ancestors left the isRemote checkbox to determine the source of metrics (at that time, Moira had only two metric sources: local and remote Graphite). And we want to move to a field with an enumerated type, so that in the future we can arbitrarily expand the set of supported sources.
Was:
type Trigger struct {
…
IsRemote bool
}
We want to receive:
type Trigger struct {
…
TriggerSource TriggerSource
}
The first thing that comes to mind is to replace all the data in the database with a new version, with an enumerated type. In a relational database, this would be a single, atomically executed migration. However, in Redis, this will not work.
We use a more sophisticated approach. The trigger, which is actually stored in the database, is represented by a separate trigger Storage Element structure, and we save both fields in it. And in the abstraction layer above the work with the database, we convert it into the Trigger structure used in the rest of the project, which will contain the current field filled with the correct value.
type triggerStorageElement struct {
…
IsRemote bool
TriggerSource moira.TriggerSource
}
func (se *triggerStorageElement) toTrigger() Trigger {
...
se.TriggerSource.FillInIfNotSet(se.IsRemote)
...
}
Cleaning up obsolete data
Moira stores local metrics for an hour to avoid wasting unnecessary memory and to discourage users from creating heavy alerts that request metrics for long periods of time.
So, we already discussed that the key has an expire. So, in theory, we could set an expire in an hour and be done. But the thing is that our metrics are stored in a sorted ZSet with internal keys, so expire is not suitable.
You can delete all obsolete metrics at the moment when we get them from the database. But it turns out that this also does not allow you to clean everything, because there are situations when you stop writing metrics and at the same time stop using them. And the old metrics live on.
Therefore, to solve this problem, you need to write a cron job that regularly cleans these metrics from the database. And in the end, the final solution looks something like this:
We use this approach when we remove old metrics, tags, verification results, and old users.
After adding one of these cleanups, ZRange's execution time dropped dramatically:
The result exceeded all our expectations – we not only cleared the keys, reducing the amount of data selected, we were able to remove unnecessary data, on which additional queries were built. And Redis digests a large number of even small queries quite hard due to its single-threaded nature. Speaking of which…
Single-threading and Redis optimization
As mentioned above, Redis's biggest bottleneck is its single-threading. All read and write operations are performed synchronously in one thread. This solves many data consistency issues, but has a negative impact on performance. If you make a lot of requests, you'll get something like this:
We spend only 20% of hardware resources, there is room to expand! But on each machine one core works at 80%, while the others are resting.
Therefore, we use several approaches to optimization:
Use servers with a small number of high-performance cores.
Use Redis-Cluster – the load can be distributed between several machines.
Host masters and slaves on one machine – this will allow you to utilize more resources (the bravest can host several masters on one machine).
Collect requests into pipes – you will get fewer requests for a larger number of batches.
Pipeline allows you to collect a set of requests into one big batch and send it to Redis. This is extremely optimal as long as the whole batch fits in your memory. But if it does not, you can suddenly get Out of Memory.
To distribute a gradually arriving flow of metrics into batches, Moira uses approximately the following logic:
batchTimer := time.NewTimer(timeout)
defer batchTimer.Stop()
for {
batch := make(map[string]*moira.Metric)
retry:
select {
// Читаем метрики из канала с метриками
case metric := <-metrics:
AddMetric(batch, metric)
// Если пачка ещё не заполнилась, продолжаем наполнять
if len(batch) < capacity {
goto retry
}
// Если заполнилась – отправляем её на обработку
batchedMetrics <- batch
// Если пачка не заполнена, но прошло достаточно много времени – тоже отправляем в обработку
case <-batchTimer.C:
batchedMetrics <- batch
}
batchTimer.Reset(timeout)
}
Conclusion
This concludes the main simple secrets of using Redis in Moira's team. Many further nuances of improving performance come down to choosing a good data representation in the database and writing convenient abstractions.
You can see how we did it in the repository with the code or even come and contribute!