How a small “tuning” of Talos Linux increased the performance of NVMe SSDs by 2.5 times

Background

I recently started preparing another Kubernetes cluster on Bare Metal servers for one of our projects in order to move away from Google Cloud and reduce infrastructure costs by about 4 times, while receiving 4 times more vCPU/RAM/SSD resources (and the performance of network drives in the clouds leave much to be desired).

As an OS I decided to take my beloved Talos Linuxwhich makes it very easy to deploy a Kubernetes cluster in any environment, easily update its components and maintain the configuration in the same state on all nodes of the cluster thanks to declarativeness and immutability.

I decided to deploy all this at Hetzner on Dell PowerEdge R6615 servers (line DX182). The configuration of each server looks like this:

  • 1x AMD EPYC™ GENOA 9454P (96 vCPU, Zen 4)

  • 384GB DDR5 ECC RAM (12x32GB)

  • 2x960G SATA SSD in RAID1 using Dell PERC H755 controller. Talos Linux was installed directly on it (this OS does not support mdadm).

  • 2×7.68TB U.2 PCIe NVMe SSD Samsung PM9A3 – for Linstor storage in Kubernetes

  • 2x10G NIC (for 10G Internet uplink)

  • 2x25G NIC (for private network, combined using LACP bonding)

Beauty

Beauty

Hetzner also has a line AX162 with the same AMD 9454P processor, but significantly cheaper and with cheaper components.

For example, they use an ASRock Rack motherboard, the BMC interface of which is much worse than the iDRAC in Dell servers. By the way, for some reason Hetzner does not provide access to the BMC ASRock Rack at all, but to the iDRAC – please. On Reddit There is a topic that discusses problems with the stability of servers of this line, so think twice if you want to rent such servers.

Also, unlike Dell, the AX162 line servers do not have power reserves, since there is only one power supply.

AX162 line server

AX162 line server

And this is what the selected technology stack for the Kubernetes cluster looks like:

  • Cilium as a CNI with eBPF, kube-proxy replacement, native routing, BBR, XDP and other goodies enabled

  • Linstor+DRBD in async mode for storage on top of LVM Stripe.

  • VictoriaMetrics K8S Stack for monitoring

  • VictoriaLogs for collecting, storing and analyzing logs

  • FluxCD to declaratively manage all this using the GitOps approach.

Method for testing NVMe SSD drives

Before launching a cluster in production, it is worth testing it thoroughly; this includes not only failover tests to understand how the cluster behaves in certain situations, but also performance benchmarks of the processor, memory, network and disks. This will help us in the future understand how updates to the OS, kernel and other components affect performance over time. And in general, it wouldn’t hurt to understand what our hardware is capable of, especially in comparison with clouds. It’s the disks that we’ll focus on in this article.

By the way, before these tests I updated the firmware of the NVMe SSD drives, as well as all other components, including the BMC, BIOS, network cards and others. I also changed the sector size on disks from 512 bytes to 4 kilobytes.

As a testing method, I chose, in my opinion, a fairly objective set of tests that is representative of most typical workloads in modern clusters (DBMSs such as Postgres and ClickHouse, S3 object storage, etc.):

fio -name=randwrite_fsync -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=randwrite -bs=4k -numjobs=1 -iodepth=1 -fsync=1
fio -name=randwrite_jobs4 -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=randwrite -bs=4k -numjobs=4 -iodepth=128 -group_reporting
fio -name=randwrite -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=randwrite -bs=4k -numjobs=1 -iodepth=128
fio -name=write -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=write -bs=4M -numjobs=1 -iodepth=16
fio -name=randread_fsync -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=randread -bs=4k -numjobs=1 -iodepth=1 -fsync=1
fio -name=randread_jobs4 -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=randread -bs=4k -numjobs=4 -iodepth=128 -group_reporting
fio -name=randread -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=randread -bs=4k -numjobs=1 -iodepth=128
fio -name=read -filename=/dev/nvme0n1 -output-format=json -ioengine=libaio -direct=1 -runtime=60 -randrepeat=0 -rw=read -bs=4M -numjobs=1 -iodepth=16

I took this set of tests Here. By the way, this is a rather interesting article from @vitalif, I recommend reading it.

Please note that fio tests are run directly on the block device /dev/nvme0n1to eliminate the influence of the file system on the results. Also, before each individual call to fio, the command was executed blkdiscard -f /dev/nvme0n1 in order to exclude the influence of the previous test on the next one.

I have prepared a convenient docker image maxpain/fio:3.38which includes the latest version of fio as well as the above test suite in the script /run.shwhich I kindly borrowed from @kvaps and slightly modified. It runs for 8 minutes and produces a CSV string that can be easily pasted into a Google sign like thisimmediately receiving visual graphs and a complete picture of Bandwidth/IOPS/Latency.

Running example in Docker:

docker run -it --rm --entrypoint /run.sh --privileged maxpain/fio:latest /dev/nvme0n1

An example of launching in Kubernetes (after starting the pods, we go inside using kubectl exec):

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: fio-test
  namespace: debug
spec:
  selector:
    matchLabels:
      app: fio-test
  template:
    metadata:
      labels:
        app: fio-test
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: fio
          image: maxpain/fio
          command:
            - sleep
            - infinity
          securityContext:
            privileged: true

Testing on different OS

I originally wanted to measure the impact of DRBD over LVM on I/O performance compared to a RAW disk, but I fooled myself by immediately running the tests on Talos Linux and taking them for granted. At some point, comparing the resulting numbers with the results of @kvaps testing on much slower and older servers, I was surprised that my results were much worse.

I decided to test the performance of the RAW disk on Debian 12 by compiling the same LTS version of the Linux 6.6.54 kernel used in Talos Linux v1.8.1, and this is what I got:

Just in case, I decided to compare all sorts of system parameters like scheduler And cpu governor (on both systems there was performance), but they were all the same.

What kind of mysticism? Obviously it's a matter of kernel configuration. I decided to compile the Linux kernel for Debian using kernel config from Talos Linux and got performance degradation similar to what I saw directly in Talos Linux!

IOMMU

Looking for a parameter in the kernel config that kills performance so much is like looking for a needle in a haystack, since there are more than 6000 lines in the kernel config, but what if there are two files? How to understand which parameter has an effect? What if there are several influencing parameters?

The script helped me in solving such a difficult problem scripts/diffconfig from the Linux kernel, it receives 2 configs as input and produces a diff like this:

-AMD_XGBE_HAVE_ECC y
 NUMA_BALANCING y -> n
 NUMA_EMU y -> n
 NVIDIA_WMI_EC_BACKLIGHT m -> n
 NVME_AUTH n -> y
 NVME_CORE m -> y
+IMA_LOAD_X509 n

The output Diff file turned out to be quite large and it was still not possible to manually find the ill-fated parameter.

ChatGPT came to the rescue, namely a new model o1-previewwhich happily swallowed the aforementioned huge diff file and, to my surprise, produced the problematic parameter on the first try:

IOMMU default settings:

Talos Linux has the option enabled CONFIG_IOMMU_DEFAULT_DMA_STRICT=ywhile in Debian it is used CONFIG_IOMMU_DEFAULT_DMA_LAZY=y. Mode strict in IOMMU causes the kernel to immediately flush IOMMU caches on every DMA bind and unbind (that is, every I/O), which results in additional load on the system and can significantly reduce I/O performance during intensive operations such as IOPS testing.

Recommended Action: Change setting CONFIG_IOMMU_DEFAULT_DMA_STRICT=y on CONFIG_IOMMU_DEFAULT_DMA_LAZY=y in the Talos Linux kernel configuration to match the Debian setup and reduce DMA overhead.

However, you can not rebuild the kernel, but instead pass the kernel argument iommu.strict=0 :

config IOMMU_DEFAULT_DMA_LAZY
bool “Translated – Lazy”
help
Trusted devices use translation to restrict their access to only
DMA-mapped pages, but with “lazy” batched TLB invalidation. This
mode allows higher performance with some IOMMUs due to reduced TLB
flushing, but at the cost of reduced isolation since devices may be
able to access memory for some time after it has been unmapped.
Equivalent to passing”iommu.passthrough=0 iommu.strict=0“on the
command line.

And yes, this setting actually brought I/O performance on Talos Linux back to Debian levels!

By the way, Talos is not the only OS that has encountered a similar problem. For example, judging by this bug reportin Ubuntu 24.04, due to the same setting, some people experienced a 3.5 times decrease in performance! But not disks, but networks:

iommu.strict=1 (strict): 233464.914 TPS
iommu.strict=0 (lazy): 835123.193 TPS

Something tells me that GPUs in AI/ML clusters will also have problems because of this setting.

Talos Linux developers promptly changed the default value of this parameter, so that in new versions of this OS it will not be necessary to pass iommu.strict=0 to fix performance issues.

Well, I found this problem quite quickly and easily, but this is not the end of the torment.

Security patches and spec_rstack_overflow

I was curious how the performance of NVMe SSD drives is affected by the version of the Linux kernel, so I compiled all the kernels from Linux 6.1 to 6.11, and this is what I got:

How is this possible? Why do 6.2 and 6.3 have twice as much IOPS? Why do 6.1 have the same values ​​as 6.4 and newer kernels? I couldn't think of anything better to do git bisectand after compiling 15 more different kernel options I found ill-fated commit.

It turns out that a vulnerability was found in AMD processors Speculative Return Stack Overflow (SRSO)the patch for which greatly affects I/O performance.

Since I always update the processor microcode thanks to the official Talos extensions amd-ucode And intel-ucodecollecting custom images on factory.talos.devI can afford to disable the software patch for this vulnerability using the kernel argument spec_rstack_overflow=microcode.

It would be possible to use mitigations=offbut I do not recommend doing this, at least according to my tests there is no performance gain compared to spec_rstack_overflow=microcode I didn't receive it.

The question still remains, why is there no security patch in kernels 6.2 and 6.3, but there is one in 6.1? It's simple, 6.1 is a Long Term Support (LTS) release, and 6.2 and 6.3 are an End Of Life (EOL).

The influence of each parameter separately

I decided to check how much each kernel arg parameter affects I/O performance, here's what I got:

Among other things, I highly recommend using Performance CPU Governor (kernel argument) on any Bare Metal installations cpufreq.default_governor=performance). Also interesting is that the new scaling driver amd_pstate had no effect on I/O. From mitigations=off there is also no gain compared to disabling the SRSO patch.

Bottom line

Thanks to such primitive tuning, I was able to return I/O performance in Talos Linux to the level of vanilla Debian, no matter how ironic it may sound. Well, or almost succeeded:

For all other tests there is parity, but in Rand 4K T1Q128 Talos Linux lags by 70K IOPS (14%).

The Talos Linux kernel is configured according to the KSPP (Kernel self-protection project) guidelines, so Page Table Isolation (PTI) is enabled. Talos Linux does not allow you to disable it (and rightly so), but in Debian I decided to enable it and check the effect of PTI on I/O – performance decreased from 520K IOPS to 480K.

Here is the final set of kernel args that can “unblock” iops:

machine:
  install:
    extraKernelArgs:
      - cpufreq.default_governor=performance
      - amd_pstate=active
      - iommu=off
      - spec_rstack_overflow=microcode

I don’t have virtualization, so I decided to disable iommu completely, which gave another small increase.

I have no doubt that there are still a lot of ways to tweak Linux, but this is beyond my current knowledge, and overall I’m pleased with the result.

Bonus

For storage in Kubernetes, I decided to use one of the most productive solutions at the moment – Linstor. There is a convenient operator for Kubernetes – Piraeus Operator.

I created two StorageClassnvme-lvm-local And nvme-lvm-replicated-async. In the first case, we mount a thick LVM volume directly into the pod, without using DRBD replication, since many modern DBMSs can replicate themselves and it is better to use this, since this is a more efficient approach. The second one uses asynchronous DRBD replication to another server. Most often, this approach is used for applications that cannot replicate themselves.

In the case of DRBD replication, pods always work with data locally thanks to the settings volumeBindingMode: WaitForFirstConsumer And linstor.csi.linbit.com/allowRemoteVolumeAccess: "false"which made it possible to squeeze out maximum performance despite replication.

The results were as follows:

Google spreadsheet with details is located here.

Thank you for your attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *