Is storage speed suitable for etcd? Ask fio

A short story about fio and etcd

The performance of an etcd cluster depends largely on the performance of its storage. etcd exports some metrics to Prometheus to provide the necessary storage performance information. For example, the metric wal_fsync_duration_seconds. The etcd documentation says: for the storage to be considered fast enough, the 99th percentile of this metric must be less than 10 ms. If you plan to run the etcd cluster on Linux machines and want to evaluate whether your storage is fast enough (for example, SSD), you can use fio, a popular tool for testing I / O operations. Run the following command, where test-data is the directory under the storage connection point:

fio --rw = write --ioengine = sync --fdatasync = 1 --directory = test-data --size = 22m --bs = 2300 --name = mytest

You just need to look at the results and check that the 99th percentile of the fdatasync duration is less than 10 ms. If yes, you have a fairly fast storage. Here is an example of the results:

    sync (usec): min = 534, max = 15766, avg = 1273.08, stdev = 1084.70
  sync percentiles (usec):
   | 1.00th =[ 553], 5.00th =[ 578]10.00th =[ 594], 20.00th =[ 627],
   | 30.00th =[ 709], 40.00th =[ 750], 50.00th =[ 783], 60.00th =[ 1549],
   | 70.00th =[ 1729], 80.00th =[ 1991], 90.00th =[ 2180], 95.00th =[ 2278],
   | 99.00th =[ 2376], 99.50th =[ 9634], 99.90th =[15795], 99.95th =[15795],
   | 99.99th =[15795]

Notes

We have configured the values of the parameters –size and –bs for our specific script. To get a useful result from fio, specify your values. Where to get them? Read how we learned to customize fio.
During testing, the entire I / O load comes from fio. In the real scenario, most likely, the storage will receive other write requests, except for those related to wal_fsync_duration_seconds. The additional load will increase the value of wal_fsync_duration_seconds. So if the 99th percentile almost reached 10 ms, your storage will not have enough speed.
Take version fio not lower than 3.5 (previous ones do not show percentiles of fdatasync duration).
The above shows only a fragment of the results from fio.

Long story about fio and etcd

What is WAL in etcd

Typically, databases use a proactive write log; etcd also uses it. Here we will not discuss in detail the forward-looking log (write-ahead log, WAL). We just need to know that every member of the etcd cluster keeps it in persistent storage. etcd records each operation with key-value pairs (for example, an update) in WAL before applying them to the repository. If between the snapshots one of the members of the repository crashes and restarts, it can locally recover transactions since the last snapshot of the WAL content.

When a client adds a key to the repository of key-value pairs or updates the value of an existing key, etcd records this operation in WAL, which is a regular file in the persistent store. Before proceeding further, etcd MUST be completely sure that an entry in WAL has actually occurred. In Linux, a single write system call is not enough, since the actual write to the physical storage may be delayed. For example, Linux may for some time store a WAL entry in the cache in the kernel’s memory (for example, a page cache). And in order for the data to be accurately recorded in the persistent storage, you need the fdatasync system call after recording, and etcd just uses it (as you can see from the work of strace, where 8 is the WAL file descriptor):

21: 23: 09.894875 lseek (8, 0, SEEK_CUR) = 12808 <0.000012>
21: 23: 09.894911 write (8, ".  0  0  0  0  0  0  202  10  2  20  361  223  255  266  6  32 $  10  0  20  10  30  26  " 34 "  r  n  3fo "..., 2296) = 2296 <0.000130>
21: 23: 09.895041 fdatasync (8) = 0 <0.008314>

Unfortunately, writing to the persistent storage does not take place instantly. If the fdatasync call is slow, the performance of the etcd system drops. The etcd documentation states that storage is considered fast enough if in the 99th percentile of calls to fdatasync it takes less than 10 ms to write to the WAL file. There are other useful metrics for storage, but in this post we only talk about this metric.

Evaluation of storage using fio

If you need to assess whether your storage is suitable for etcd, use fio, a very popular I / O load test tool. It should be remembered that disk operations can be very different: synchronous and asynchronous, many classes of system calls, etc. As a result, fio is very difficult to use. It has many parameters, and different combinations of their values produce completely different I / O workloads. To get adequate numbers for etcd, you should make sure that the recording test load from fio is as close as possible to the actual load from etcd when writing WAL files.

Therefore, fio should, at a minimum, create a load in the form of a series of consecutive write operations to a file, each record will consist of a write system call followed by a fdatasync system call. For consecutive write operations, fio needs the –rw = write parameter. To use the write call system instead of pwrite when writing fio, you should specify the –ioengine = sync parameter. Finally, in order for fdatasync to be called after each record, you need to add the parameter – fdatasync = 1. The other two parameters in this example (–size and –bs) depend on the specific script. In the next section, we will explain how to configure them.

Why exactly fio and how we learned to tune it

In this post we describe the real case. We had a cluster of Kubernetes v1.13, which we monitored using Prometheus. etcd v3.2.24 hosted on SSD. The etcd metrics showed too high latency for fdatasync, even when the cluster did nothing. The metrics were weird, and we really didn't know what they meant. The cluster consisted of virtual machines, it was necessary to understand what the problem was: in physical SSDs or in the virtualization layer. In addition, we often made changes in the configuration of hardware and software, and we needed a way to evaluate their results. We could run etcd in each configuration and look at the metrics of Prometheus, but this is too troublesome. We were looking for a fairly simple way to evaluate a specific configuration. We wanted to check whether we understand Prometheus metrics from etcd.

But for this it was necessary to solve two problems. First, what does the I / O load that etcd creates when writing to WAL? What system calls are used? What is the size of the records? Secondly, if we answer these questions, how do you reproduce a similar workload with fio? Remember that fio is a very flexible tool with many options. We solved both problems with a single approach — using the lsof and strace commands. lsof displays all the file descriptors used by the process, and the associated files. And with the help of strace, you can study an already running process or start a process and study it. strace displays all system calls from the process being studied (and its child processes). The latter is very important, because etcd just takes a similar approach.

First of all, we used strace to examine the etcd server for Kubernetes when there was no load on the cluster. We saw that almost all WAL entries were about the same size: 2200–2400 bytes. Therefore, in the command at the beginning of the post, we specified the parameter –bs = 2300 (bs means the size in bytes for each fio entry). Note that the size of the etcd entry depends on the etcd version, delivery, parameter values, etc., and affects the duration of fdatasync. If you have a similar scenario, examine your etcd processes with strace to find out the exact numbers.

Then, to get a good idea of the actions in the etcd file system, we launched it with strace and with the -ffttT parameters. So we tried to study the child processes and write the output of each of them in a separate file, and also get detailed reports on the beginning and duration of each system call. We used lsof to confirm our analysis of the output of strace and see which file descriptor was used for what purpose. So with the help of strace we got the results shown above. Synchronization time statistics confirmed that the wal_fsync_duration_seconds from etcd met the fdatasync invocations with WAL file descriptors.

We studied the fio documentation and chose parameters for our script so that fio generates a load similar to etcd. We also checked the system calls and their duration by running fio from strace, similar to etcd.

We carefully selected the value of the –size parameter, which represents the entire I / O load from fio. In our case, this is the total number of bytes written to the storage. It turned out to be directly proportional to the number of write (and fdatasync) system calls. For a specific bs value, the number of calls fdatasync = size / bs. Since we were interested in the percentile, we should have had enough samples for validity, and we calculated that 10 ^ 4 will be enough for us (it turns out 22 mebibytes). If –size is smaller, outliers may occur (for example, several fdatasync calls work longer than usual and affect the 99th percentile).

Try it yourself

We showed how to use fio and find out if the storage has enough speed for high performance etcd. Now you can try it in practice on your own, using, for example, SSD virtual machines in the IBM Cloud.