Investigating phantom disk reads in Linux

Not so long ago, one of our users reported to us about a case of strange use of equipment. He, with the help of our client ILP (InfluxDB Line Protocol) was inserting rows into its QuestDB database, but along with disk writes, there were also significant disk reads. This is not expected from a write-only workload, so we needed to get to the bottom of the cause of this problem. Today we will share this story full of ups and downs and the magic of the Linux kernel.

▍ Issue: Unexpected write operations

As you may already know, QuestDB has a columnar storage format append-only data. In practice, this means that when the database writes a new row, it adds the values ​​of its column to a set of column files.

This provides a high level of disk writes due to the sequential write pattern. When handling a disk write-only load, you don’t expect the database to read a lot from disk. That is why this problem came as a surprise to us.

After a few attempts, we were able to recreate the problem. The symptoms looked like this iostat:

Here we see 10-70 MB / s writes to disk, while reading is as much as 40-50 MB / s. In situations where data is only being written, the use of read operations should be close to zero. Therefore, a high reading volume is very unexpected.

▍ Using Linux tools to investigate

First of all, we wanted to understand what specific files the database was reading. Fortunately, nothing is impossible in Linux. To collect information about all read operations of the selected disk, you can use the utility blktrace in combination with blkparse:

sudo blktrace -d /dev/nvme0n1 -a read -o - | blkparse -i -

The command shown above prints all disk read events on the system. Its output looks something like this:

259,0    7     4618     5.943408644 425548  Q  RA 536514808 + 8 [questdb-ilpwrit]
259,0    7     4619     5.943408967 425548  G  RA 536514808 + 8 [questdb-ilpwrit]
259,0    7     4620     5.943409085 425548  P   N [questdb-ilpwrit]
259,0    7     4621     5.943410717 425548  A  RA 536514832 + 8 <- (259,7) 6699280
259,0    7     4622     5.943410836 425548  Q  RA 536514832 + 8 [questdb-ilpwrit]
259,0    7     4623     5.943411176 425548  G  RA 536514832 + 8 [questdb-ilpwrit]
259,0    7     4624     5.943411707 425548 UT   N [questdb-ilpwrit] 2
259,0    7     4625     5.943411781 425548  I  RA 536514832 + 8 [questdb-ilpwrit]
259,0    7     4626     5.943411865 425548  I  RA 536514808 + 8 [questdb-ilpwrit]
259,0    7     4627     5.943414234   264  D  RA 536514832 + 8 [kworker/7:1H]
259,0    7     4628     5.943414565   264  D  RA 536514808 + 8 [kworker/7:1H]
259,0    7     4629     5.943472169 425550  C  RA 536514808 + 8 [0]
259,0    7     4630     5.943473856 425550  C  RA 536514832 + 8 [0]
... и множество других событий

Each line here stands for a separate event. For simplicity, consider the first line:

259,0    7     4618     5.943408644 425548  Q  RA 536514808 + 8 [questdb-ilpwrit]

We are interested in the following parts:

  • 425548 – this event was generated with pid 425548.
  • QRA – This event corresponds to a read request from a disk added to the I/O queue. The “A” suffix is ​​poorly documented, but is known to denote a potential readahead operation. What is “preemptive reading”, we will learn a little later.
  • 536514808+8 – This read operation starts at block 536514808 and has a size of 8 blocks.
  • [questdb-ilpwrit] – The operation was started by the ILP writer thread of the QuestDB database.

Armed with this knowledge, we can use debugfsto follow the block number and find the corresponding inode:

$ sudo debugfs -R 'icheck 536514808' /dev/nvme0n1
debugfs 1.46.5 (30-Dec-2021)
Block   Inode number
536514808   8270377

Finally, we can check what inode 8270377 stands for:

$ sudo debugfs -R 'ncheck 8270377' /dev/nvme0n1
debugfs 1.46.5 (30-Dec-2021)
Inode   Pathname
8270377 /home/ubuntu/.questdb/db/table_name/2022-10-04/symbol_col9.d.1092

These steps need to be done for every read event, but it’s easy to write a script to automate them. By examining the events, we found that the disk reads matched the column files. So, although the database only writes these files append-only, reads from disk somehow occur.

Another interesting fact is that the user has quite a few tables (about fifty), and each stores several hundred columns. So the database has to deal with a lot of column files. We were sure that our ILP code should only write to these files, not read from them.

Who can read these files? Maybe the operating system?

▍ The Linux Kernel Enters the Scene

Like many other databases, QuestDB uses buffered I/O to handle disk operations, for example, mmap, read And write. This means that when we write something to a file, the kernel writes the modified data to several pages in the page cache and marks it as dirty.

The page cache is a special transparent in-memory cache used by Linux to store data recently read from disk and recently modified data to be written to disk. The cached data is organized into pages, which are 4 KB in size on most distributions and CPU architectures.

In addition, it is not possible to limit the amount of RAM for the page cache, since the kernel tries to use all available RAM for it. Old pages are removed from the page cache so applications or the operating system can allocate new pages. In most cases, this happens transparently to the application.

With the standard value cairo.commit.mode QuestDB doesn’t make explicit calls fsync/msync to flush the column files to disk, so the kernel does all the flushing. So we should better understand what to expect from the kernel before hypothetically arguing about our situation with “phantom reads”.

As we already know, the operating system does not write file data modifications to disk immediately. Instead, it writes them to the page cache. This is called a write-back caching strategy. Lazy writing means that a background process is responsible for writing dirty pages to disk. On Linux it does pdflush – a set of kernel threads responsible for the delayed writing of “dirty” pages.

pdflush has many options to configure. Here are the most important ones:

  • dirty_background_ratio – if the percentage of “dirty” pages is less than this parameter, “dirty” pages remain in memory until they become obsolete. When the number of “dirty” pages exceeds this parameter, the kernel proactively runs pdflush. On Ubuntu and other Linux distributions, this setting is by default 10 (10%).
  • dirty_ratio – when the percentage of “dirty” pages exceeds this parameter, write operations cease to be asynchronous. This means that the writing process (in our case, the database process) will write pages to disk synchronously. When this happens, the corresponding thread is put into a “uninterruptible sleep” state (status code D in utility top). By default, this setting is set to 20 (20%).
  • dirty_expire_centisecs – This parameter defines the time in centiseconds after which they become too old for lazy writing. This setting usually defaults to 3000 (30 seconds).
  • dirty_writeback_centisecs – determines the wakeup interval for the pdflush process. This setting usually defaults to 500 (5 second).

The current values ​​of the above parameters can be checked through the virtual file system /proc:

$ cat /proc/sys/vm/dirty_background_ratio
$ cat /proc/sys/vm/dirty_ratio
$ cat /proc/sys/vm/dirty_expire_centisecs
$ cat /proc/sys/vm/dirty_writeback_centisecs

It’s also important to mention that the above percentages are calculated based on the total recoverable memory, not the total RAM available on the machine. If your application does not create a lot of “dirty” pages, and the amount of RAM is large, all writes to disk are performed pdflush asynchronously.

However, if the amount of memory available for the page cache is small, then pdflush will write data most often, with a high chance that the application will be put into a “uninterruptible sleep” state and blocked from writing.

Adjusting these parameters did not help us much. You remember that the user is writing to a large number of columns, right? This means that for processing extraordinary entries (out-of-order, O3) The database needs to allocate memory for each column, which leaves less memory for the page cache. We checked this first: indeed, the bulk of the RAM was being used by the database process. Changing a parameter cairo.o3.column.memory.size from the standard 16 MB to 256 KB helped to significantly reduce the frequency of reading from the disk, that is, the problem was somehow related to memory pressure. Don’t worry if you don’t understand this paragraph completely. Most importantly, reducing the memory used by the database has reduced the volume of write operations. This is a helpful hint.

But what was the reason for reading from the disk?

To answer this question, we need to better understand the disk-reading aspect of the page cache. To keep things simple, we’ll look at I/O based on mmap. Once you complete mmap for a file, the kernel allocates page table entries (PTEs) to virtual memory to reserve the address range for that file, but does not yet read the contents of the file. The data itself is read onto the page when you access the distributed memory, that is, start reading (command LOAD in x86 set) or write (command STORE in the x86 set) into memory.

When it first happens memory control unit (memory management unit, MMU) of your CPU signals a special event called “page fault”. Page fault means that the memory being accessed belongs to a PTE that has no physical memory allocated. The kernel handles page faults in two ways:

  • If the page cache already contains the corresponding file’s data in memory, such as belonging to the same file opened by another process, then the kernel simply updates the PTE to map to the existing page. This is called a minor page fault.
  • If the file data is not yet cached, the kernel must block the application, read the page, and only then update the PTE. This is called a major page fault.

As you might guess, major page faults are much more expensive than minor page faults, so Linux tends to minimize their number with an optimization called readahead (“read-ahead”; sometimes also called “fault-ahead” or “pre-fault”).

At a conceptual level, read-ahead is responsible for reading data that an application has explicitly requested. On the first access (read or write) to a few bytes of a newly opened file, a major page fault occurs, and the operating system reads the data corresponding to the requested page plus many pages before and after the file. This is called “read-around”. If you continue to access subsequent pages of the file, the kernel recognizes the sequential access pattern and starts prefetching, trying to read in advance a group of subsequent pages.

Because of this, Linux tries to optimize for sequential access patterns, as well as increase the chances of hitting an already cached page in case of random access.

Remember disk read event from output blktrace? The “A” suffix in the “RA” operation type indicates that the disk read was part of a read ahead. However, we know that the load was just reading, working with a large number of files. The problem is much more noticeable when there is not much memory left for page cache tasks.

What if pages are evicted from the page cache too early, causing excessive read-ahead on subsequent memory accesses?

We can test this hypothesis by disabling prefetch. To do this, it is enough to execute a system call madvise with flag MADV_RANDOM. This tells the kernel that the application will arbitrarily access the mmaped file, meaning that lookahead should be disabled for that file.

And the problem is solved! There were no more “phantom readings”:

▍ Happy ending

As a result, we have identified suboptimal kernel read-ahead behavior. Consuming a large number of column files with high memory pressure caused the kernel to start read-ahead disk operations that you wouldn’t expect from a write-only load. Then we just needed apply madvise in our code to disable read-ahead in the table writer.

What is the moral of the story? First, we couldn’t fix this problem if we didn’t know how buffered I/O works in Linux. When you know what to expect from the operating system and where to look, then selecting and using the right utilities becomes a trivial task. Secondly, any user complaints, even those that are difficult to reproduce, become an opportunity to learn something new and make QuestDB better.

Telegram channel with prize draws, IT news and posts about retro games 🕹️

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *