Linux Pipes are slow

Some programs use a system call vmsplice to move data through the pipe faster. Francesco already conducted a detailed analysis of the use of vmsplice to speed up work. However, while experimenting, I noticed that without vmsplice pipes in Linux work slower than I expected. Since vmsplice can't be used all the time, I wanted to understand why this happens and whether it is possible to speed up the pipe.

I am writing a program for ultra-fast Morse code encoding/decoding and I use pipe to transfer data.

The first thing that comes to mind for research is Fizz Buzz throughput competition at the Code Golf StackExchangeThere are two types of solutions:

  1. the first ones reach speeds of up to several gigabytes per second, for example, the solution Neilreaches 8.4 GiB/s;

  2. the latter significantly exceed the results of the former, starting from the solution Timo Kluck reaching 15.5 GiB/s, ending with solutions ais523 reaching 60.8 GiB/s and David Frank reaching 208.3 GiB/s when using multiple cores.

The difference between the first and second group is that the second one uses vmspliceand the first one is not. But how vmsplice can provide such a significant performance boost? My gut feeling is that vmsplice avoids copying data to and from the kernel space. After all, copying data can't be slower than generating it, right? Even if we assume that it's not faster, and that it's necessary to copy the data twice to transfer it through the pipe, we could expect a speed increase of 3 times at most. But in reality, we see a 7-fold increase, even when considering solutions using a single core.

It's like I'm missing something and I want to figure out what it is.

I'll do my own measurements first, to make it easier to compare with what I'll do next. Compiling and running ais523's solution on my computer, I get the following results:

$ ./fizzbuzz | pv >/dev/null
96.4GiB 0:00:01 [96.4GiB/s]

With David's solution the results reach 277 GB/s using 7 cores (40 GB/s per core).

Now, to understand what is happening, we need to answer the following questions:

  1. How fast can we write data under ideal conditions?

  2. How fast can we actually write data to a pipe?

  3. How does it help? vmsplice?

Recording data in an ideal world

First, let's look at the following program, which simply copies data without making any system calls. I use std::hint::black_boxto prevent the compiler from noticing that the result is not used. Otherwise, the compiler would optimize the program to nothing.

fn main() {
    let dst = [0u8; 1 << 15];
    let src = [0u8; 1 << 15];
    let mut copied = 0;
    while copied < (1000 << 30) {
        std::hint::black_box(dst).copy_from_slice(&src);
        copied += src.len();
    }
}

On my system it runs at 167 GB/s, which is about the L1 cache write speed for my CPU.

At profiling with ftrace we see that 99.9% of the time is spent on the function __memset_avx512_unaligned_ermswhich is called directly from main and does not call other functions. Flame Graph almost flat. If you don't want to use a full-fledged profiler, you can just use gdb and press Ctrl+C at random time:

$ cargo build --release
$ gdb target/release/copy 
…
(gdb) run
…
^C (hitting Ctrl+C)
Program received signal SIGINT, Interrupt.
__memset_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:236
…
=> 0x00007ffff7f15dba    f3 aa    rep stos %al,%es:(%rdi)

In any case, note that AVX-512 is used. Mention memset in the title may be unexpected – this is due to the fact that part of the logic is common with memcpyThe implementation is in general file dedicated to SIMD vectorizationwhich supports SSE, AVX2 and AVX-512. In our case, specialization for AVX-512 is used.

Note that the implementation memcpy in glibc uses vm_copy to copy pages directly in systems based on Mach (mostly Apple products) which use kernel function to directly copy pages.

However, AVX-512 is a rather niche technology. According to Steam users hardware dataonly about 12% of Steam users have CPUs that support AVX-512. Intel added AVX-512 support to consumer CPUs only in the 11th generation, and now reserves it only for servers. AMD CPUs support AVX-512 since the Ryzen 7000 (Zen 4) series.

So, I tested the same program with AVX-512 disabled. I used the Linux kernel option for this clearcpuid=304. I was able to verify that the function was used __memset_avx2_unaligned_ermsusing the trick with gdb and Ctrl+C. Then I did the same to disable AVX2 with clearcpuid=304,avx2,avxwhich made her use the function __memset_sse2_unaligned_erms.

Although SSE2 is always available on x86–64, I also disabled the bit cpuid for SSE2 and SSE to see if that can make it work glibc use scalar registers to copy data. As a result, I immediately got a kernel panic. Alas.

Using AVX2, the throughput was… 167 GB/s. Using only SSE2, the throughput remained… the same – 167 GB/s. To a certain extent, this makes sense: even SSE2 is enough to fully utilize the bus and cover the L1 cache throughput. Using large registers only helps when performing ALU operations.

The conclusion from this experiment is that as long as vectorization is used, the result should reach 167 GB/s.

Writing data to a pipe

Well, let's see what happens when writing to a pipe instead of user space memory:

use std::io::Write;
use std::os::fd::FromRawFd;
fn main() {
    let vec = vec![b'\0'; 1 << 15];
    let mut total_written = 0;
    let mut stdout = unsafe { std::fs::File::from_raw_fd(1) };
    while let Ok(n) = stdout.write(&vec) {
        total_written += n;
        if total_written >= (100 << 30) {
            break;
        }
    }
}

To measure throughput we will use:

cargo run --release | pv >/dev/null

On my device, the result reaches 17 GB/s. That's 10 times slower than writing to a buffer! How can a system call that essentially writes to a kernel buffer be so slow? And no, context switching doesn't take that long.

It's time to start profiling this program.

In the original article, Flame Graph is interactive.

In the original article, Flame Graph is interactive.

Please note that __GI___libc_write This glibc a wrapper that makes a system call. Everything from it onwards is done in user space, everything before it is done in the kernel.

As expected, the call takes up most of the time write. In particular, 95% of the time is spent on pipe_write. Within the function itself, 36% of the total time is spent on __alloc_pageswhich provides new memory pages for pipe. We can't just reuse the same pages, because pv moves them around using splice V /dev/nullabsorbing them in the process.

Next come __mutex_lock.constprop.0which takes up 25% of the time, and _raw_spin_lock_irqwhich takes 5%. They block writing to the pipe.

It turns out that for copying data copy_user_enhanced_fast_string spends only 20% of the time. But even with only 20% of the CPU time, we could expect a performance of 167 GB/s * 20% = 33 GB/s. This means that even by itself this function is 2 times slower __memset_avx512_unaligned_ermsused in a program that wrote to user space memory.

But what does it do? copy_user_enhanced_fast_string so slow? We need to dig deeper. It's time disassemble my linux kernel and look at the device of this function.

$ grep -w copy_user_enhanced_fast_string /usr/lib/debug/boot/System.map-6.1.0-18-amd64 
ffffffff819d3d90 T copy_user_enhanced_fast_string
$ objdump -d --start-address=0xffffffff819d3d90 vmlinuz | less   
    
vmlinuz:     file format elf64-x86-64


Disassembly of section .text:

ffffffff819d3d90 <.text+0x9d3d90>:

ffffffff819d3d90:       90                      nop
ffffffff819d3d91:       90                      nop
ffffffff819d3d92:       90                      nop
ffffffff819d3d93:       83 fa 40                cmp    $0x40,%edx
ffffffff819d3d96:       72 48                   jb     0xffffffff819d3de0
ffffffff819d3d98:       89 d1                   mov    %edx,%ecx
ffffffff819d3d9a:       f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
ffffffff819d3d9c:       31 c0                   xor    %eax,%eax
ffffffff819d3d9e:       90                      nop
ffffffff819d3d9f:       90                      nop
ffffffff819d3da0:       90                      nop
ffffffff819d3da1:       e9 9a dd 42 00          jmp    0xffffffff81e01b40
...
ffffffff81e01b40:       c3                      ret

Instructions NOP at the beginning and end of the function allow ftrace insert instructions for tracing when needed. This allows you to collect performance data on specific kernel functions without slowing down others. The processor decoder pipeline takes care of NOP in advance, so the impact on performance should be minimal (aside from their use of L1i cache).

What I don't understand is why it is used JMPand not just RET.

In any case, check CMP and jump JB cover the use of buffers smaller than 64 bytes, moving to another function that copies 8 bytes at a time to 64-bit registers and then 1 byte at a time to an 8-bit register in 2 cycles. Copying large buffers is done by the instruction REP MOVThis code is clearly not vectorized.

In fact, This function is not implemented in C, but directly in Assembly! This means that we don't need to look at the compilation result – we can go straight to the source code. And this is not a missed optimization at the compilation stage, it was written this way from the start.

But is the lack of a vector instruction the only reason that copy_user_enhanced_fast_string 2 times slower __memset_avx512_unaligned_erms? To test this, I adapted the original Rust program using REP MOVS:

use std::arch::asm;

fn main() {
    let src = [0u8; 1 << 15];
    let mut dst = [0u8; 1 << 15];
    let mut copied = 0;
    while copied < (1000u64 << 30) {
        unsafe {
            asm!(
                "rep movsb",
                inout("rsi") src.as_ptr() => _,
                inout("rdi") dst.as_mut_ptr() => _,
                inout("ecx") 1 << 15 => _,
            );
        }
        copied += 1 << 15;
    }
}

The throughput is 80 GB/s. This is the same 2x slowdown that we observed in the kernel function!

Now we know that the Linux kernel does not use SIMD for memory copying and it does copy_user_enhanced_fast_string 2 times slower than it could be.

But why? On Stack Overflow, Peter Cordes explains that using SSE/AVX instructions is not worth it in most cases due to the cost of storing and restoring the SIMD context.

To summarize: the kernel spends a lot of time on memory management and does not even use SIMD when copying bytes. This is the root cause of the 10x slowdown compared to the ideal example.

vmsplice to the rescue

Now we have an upper (167 GB/s for writing to memory 1 time) and lower bounds (17 GB/s when using write in pipe). Let's take a closer look at what it does. vmsplice. It reduces the cost of using pipes by moving buffers from user space to the kernel without copying them.

To understand how it works, read this great article by Francesco. We will use program ./write from the article as a minimal example of usage vmspliceThis program records an infinite number of 'X'. This will make profiling easier since it won't waste time calculating Fizz Buzz or anything else.

In practice ./write reaches 210 GB/s, which is significantly higher than our upper limit, but in this case the program works a little unfairly, using the same buffers for transfer in vmsplice. For anything other than a constant stream of bytes, we'd need to fill the buffers with new data, which is where we'd hit our upper bound. However, we're only interested in what makes vmsplice:

In the original article, Flame Graph is interactive.

In the original article, Flame Graph is interactive.

As in the case of writewe spend a significant amount of time (37%) on __mutex_lock.constprop.0. But now there is none. _alloc_pages And _raw_spin_lock_irq. And also instead of copy_user_enhanced_fast_string we see add_to_pipe, import_iovec And iov_iter_get_pages2From this we can see how vmsplice bypasses expensive parts of the write system call.

I was a bit surprised by the impact of buffer size, especially when vmsplice not used. It seems that minimizing the number of system calls is not always the most correct approach.

To sum it up

That's it. Writing to a pipe is ten times slower than writing directly to memory. This is because when writing to a pipe, we have to spend a lot of time on locks, and we can't use vector instructions efficiently.

In principle, we could move data at 167 GB/s, but we need to avoid the cost of buffer locks and saving and restoring the SIMD context. That's what splice And vmsplice. They are often described as a way to avoid copying data between buffers, and this is true, but most importantly, they completely bypass the conservative kernel code with its extensive routines and scalar code.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *