Linux Kernel Configuration for GlusterFS

Translation of the article was prepared on the eve of the start of the course “Administrator Linux. Professional “


Periodically, here and there questions arise about Gluster’s recommendations regarding kernel tuning and whether there is a need for this.

Such a need is rare. The kernel performs very well on most workloads. Although there is a downside. Historically, the Linux kernel readily consumes a lot of memory if given the opportunity, including for caching as the main way to improve performance.

Most of the time this works fine, but under heavy load it can lead to problems.

We have a lot of experience with systems that consume a lot of memory, such as CAD, EDA and the like, which started to slow down under high load. And sometimes we ran into problems with Gluster. After carefully observing the memory used and the latency of the disks for more than one day, we got their overload, huge iowait, kernel errors (kernel oops), freezes, etc.

This article is the result of many tuning experiments performed in various situations. These parameters not only improved overall responsiveness, but also significantly stabilized the cluster performance.

When it comes to memory tuning, the first thing to look at is the virtual memory (VM) subsystem, which has a large number of options that can confuse you.

vm.swappiness

Parameter vm.swappiness determines how much the kernel uses swap (swap) in comparison with the main memory. In the source code, it is also defined as “tendency to steal mapped memory” (tendency to steal mapped memory). A high swappiness value means that the kernel will be more prone to unload rendered pages. A low swappiness value means the opposite: the kernel will swap less pages from memory. In other words, the higher the value vm.swappiness, the more the system will use swap.

Large use of swapping is undesirable, as huge blocks of data are loaded and unloaded into RAM. Many people argue that the swapiness value should be high, but in my experience, setting it to “0” will increase performance.

You can read more details here – lwn.net/Articles/100978

But, again, these settings should be applied with caution and only after testing a specific application. For highly loaded streaming applications, this parameter should be set to “0”. Changing to “0” improves the responsiveness of the system.

vm.vfs_cache_pressure

This parameter controls the memory consumed by the kernel for caching directory objects and inodes (dentry and inode).

With the default of 100, the kernel will try to free the dentry and inode caches in fairness to the pagecache and swapcache. Decreasing vfs_cache_pressure causes the kernel to keep the dentry and inode caches. When the value is 0, the kernel will never clear the dentry and inode caches due to insufficient memory pressure, and this can easily lead to an out-of-memory error. Increasing vfs_cache_pressure above 100 causes the kernel to prioritize dentry and inode unloading.

With GlusterFS, many users with large amounts of data and many small files can easily use a significant amount of RAM on the server due to inode / dentry caching, which can lead to performance degradation as the kernel has to process data structures on a system with 40 GB of memory … Setting this parameter above 100 has helped many users achieve fairer caching and improved kernel responsiveness.

vm.dirty_background_ratio and vm.dirty_ratio

The first parameter (vm.dirty_background_ratio) defines the percentage of memory with dirty pages, after which it is necessary to start a background flush of dirty pages to disk. Until this percentage is reached, no pages are flushed to disk. And when the reset starts, it runs in the background without interrupting the running processes.

The second parameter (vm.dirty_ratio) defines the percentage of memory that can be occupied by dirty pages before the forced flash begins. When this threshold is reached, all processes become synchronous (blocked) and are not allowed to continue running until the requested I / O operation has actually completed and the data is on disk. With high I / O load, this causes a problem, since there is no data caching, and all I / O processes are blocked waiting for I / O. This leads to a large number of frozen processes, high load, unstable system operation and poor performance.

Decreasing the values ​​of these parameters leads to the fact that data is more often flushed to disk and not stored in RAM. This can help systems with a lot of memory, for which it is okay to flush the 45-90 GB page cache, which results in huge latency for front-end applications, reducing overall responsiveness and interactivity.

“1”> / proc / sys / vm / pagecache

A page cache is a cache that stores data for files and executable programs, that is, these are pages with the actual contents of files or block devices. This cache is used to reduce the number of disk reads. A value of “1” means that the cache uses 1% of the RAM and there will be more reads from the disk than from the RAM. It is not necessary to change this parameter, but if you are paranoid about control over the page cache, you can use it.

Deadline> / sys / block / sdc / queue / scheduler

The I / O scheduler is the Linux kernel component that handles read and write queues. In theory, it is better to use “noop” for a smart RAID controller, because Linux knows nothing about the physical geometry of the disk, so it is more efficient to let a controller with a good knowledge of the disk geometry process the request as quickly as possible. But deadline seems to improve performance. You can read more about schedulers in the documentation for the Linux kernel source code: linux/Documentation/block/*osched.txt… And I also saw an increase in read throughput during mixed operations (many writes).

“256”> / sys / block / sdc / queue / nr_requests

The number of I / O requests in the buffer before they are sent to the scheduler. Some controllers have an internal queue size (queue_depth) that is larger than the nr_requests of the I / O scheduler, so the I / O scheduler has little chance of properly prioritizing and merging requests. For deadline and CFQ schedulers, it is better when nr_requests is 2 times the controller’s internal queue. Combining and reordering queries helps the planner to be more responsive under heavy load.

echo “16”> / proc / sys / vm / page-cluster

The page-cluster parameter controls the number of pages to swap at one time. In the above example, the value is set to “16” to match the 64KB stripe size of the RAID. It doesn’t make sense with swappiness = 0, but if you set swappiness to 10 or 20 then using this value will help you when the RAID stripe size is 64KB.

blockdev –setra 4096 / dev /<devname> (-sdb, hdc or dev_mapper)

The default block device settings for many RAID controllers often result in terrible performance. Adding the above option configures read-ahead for 4096 * 512-byte sectors. At least for streaming operations, the speed is increased by filling the on-board disk cache with read-ahead during the period the kernel takes to prepare I / O. The cache can contain data that will be requested on the next read. Too much read-ahead can kill random I / O for large files if it is using potentially useful disk time or loading data outside of the cache.

Below are a few more guidelines at the file system level. But they have not been tested yet. Make sure your filesystem knows the stripe size and the number of disks in the array. For example, it’s a raid5 array with a 64K stripe size of six disks (actually five because one disk is used for parity). These recommendations are based on theoretical assumptions and collected from various blogs / articles by RAID experts.

-> ext4 fs, 5 disks, 64K stripe, units in 4K blocks
mkfs -text4 -E stride=$((64/4))
-> xfs, 5 disks, 64K stripe, units in 512-byte sectors
mkfs -txfs -d sunit=$((64*2)) -d swidth=$((5*64*2))

For large files, consider increasing the above stripe sizes.

ATTENTION! Everything described above is extremely subjective for some types of applications. This article does not guarantee any improvement without prior user testing of the respective applications. It should only be used if it is necessary to improve the overall responsiveness of the system, or if it solves current problems.

Additional materials:

Read more:

  • Using nftables in Red Hat Enterprise Linux 8

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *