Incredibly fast byte counting

It turned out that the topic summing ASCII integers in Haswell with memcpy speed much more popular than I would have expected. That’s why I decided to take part in another challenge in the HighLoad genre: count uint8. Currently, I am only in 13th place on the leaderboard, losing to first place by about 7%, but I have already learned a lot of interesting things. In this post I will fully describe my solution, including the amazing pattern of reading from memory. Using it, you can achieve approximately 30% (compared to conventional sequential access) transfer speeds in the context of single-core workloads limited by cache size. Apparently this method is little known.

As in other posts by the author, the program is configured for the following input characteristics of a high-load system: Intel Xeon E3-1271 v3 @ 3.60 GHz, 512 MB RAM, Ubuntu 20.04. It only uses AVX2 and does not use AVX512.

Task

“Display how many bytes correspond to the value 127 in a 250 MB file that is full of bytes evenly sampled from a range [0, 255] and sent to standard output”

There is simply nothing to add to the following! The solution we will present is 550 times faster than the following trivial program.

uint64_t count = 0;
for (uint8_t v; std::cin >> v;) {
    if (v == 127) {
        ++count;
    }
}

std::cout << count << std::endl;
return 0;

Core

The entire source code for the solution is provided at the end of this post. But first, I'll explain step by step how it works. The core consists of only three instructions, so I’ll go straight to the block __asm__ (Sorry!).

; rax — это основание ввода
; rsi — это смещение до актуального фрагмента
vmovntdqa    (%rax, %rsi, 1), %ymm4
; ymm2 — это вектор, заполненный 127
vpcmpeqb     %ymm4, %ymm2, %ymm4
; ymm6 — это аккумулятор, байты которого соответствуют 
; текущему счёту экземпляров 127
; на данной позиции во входном фрагменте 
vpsubb       %ymm4, %ymm6, %ymm6

With this code we iterate over 32 byte chunks of input and:

Loading a fragment from vmovntdqa (this is a move instruction written to memory, bypassing the cache, inserted only for stylistic purposes and does not play a role during execution).
We compare each byte in the fragment with 127 using vpcmpeqbwhat gives us 0xFF (aka -1), and this byte corresponds to 127, and all the rest – 0x00. For example,[125, 126, 127, 128, …] takes the form [0, 0, -1, 0, …].
We subtract the comparison result from the accumulator. Continuing the previous example and assuming that our accumulator is filled with zeros, we get [0, 0, 1, 0, …].

Then, to prevent this narrow accumulator from overflowing, we dump it into a wider one from time to time using the following code:

; ymm1 — это нулевой вектор
; ymm6 — это узкий аккумулятор
vpsadbw      %ymm1,%ymm6,%ymm6
; ymm3 — это широкий аккумулятор
vpaddq       %ymm3,%ymm6,%ymm3

vpsadbw adds every eight bytes in the accumulator, resulting in four 64-bit numbers, after which vpadddq sums the result with a wider accumulator in which overflow is guaranteed not to occur. At the end of the work, we extract the result to get the final score.

Nothing extraordinary yet. In fact, this is exactly the approach described in the following discussion on StackOverflow: How to count character occurrences using SIMD.

The magic begins

The difficulty of this task is that there are very few calculations within its framework, but they are very limited in memory. I studied the Intel optimization manual (it's full of typos) in search of the memory data I needed, until on page 788 I came across a story about 4 hardware prefetchers (instruction prefetch mechanisms). It seemed like three of them were only useful for sequential access (which I was already doing), but one, called “Streamer,” had an interesting twist:

“Records and maintains up to 32 data access flows. For each 4K page, you can have one forward and one return stream.”

“For every 4K page.” Do you get the point? Instead of processing the entire output sequentially, you can interleave the processing of 4 KB pages following each other. We also unwind the core a little and process an entire cache line (2×32 bytes) in each block.

#define BLOCK(offset) \
    "vmovntdqa    " #offset " * 4096 (%6, %2, 1), %4\n\t" \
    "vpcmpeqb     %4, %7, %4\n\t" \
    "vmovntdqa    " #offset " * 4096 + 0x20 (%6, %2, 1), %3\n\t" \
    "vpcmpeqb     %3, %7, %3\n\t" \
    "vpsubb       %4, %0, %0\n\t" \
    "vpsubb       %3, %1, %1\n\t" \

We place 8 of them in the main loop, where offset is set to a value from 0 to 7 inclusive.

In this case, the HighLoad score increases by about 15%, but if your core is even more limited in memory – let's say you just add bytes using vpaddbto find their sum modulo 255, you can win up to 30%. Impressive considering how simple a change it is!

Anyway, there is one more small detail: we add a prefetch of the four closest cache lines:

#define BLOCK(offset) \
    "vmovntdqa    " #offset " * 4096 (%6, %2, 1), %4\n\t" \
    "vpcmpeqb     %4, %7, %4\n\t" \
    "vmovntdqa    " #offset " * 4096 + 0x20 (%6, %2, 1), %3\n\t" \
    "vpcmpeqb     %3, %7, %3\n\t" \
    "vpsubb       %4, %0, %0\n\t" \
    "vpsubb       %3, %1, %1\n\t" \
    "prefetcht0   " #offset " * 4096 + 4 * 64 (%6, %2, 1)\n\t"

Why exactly 4 cache lines? I can’t clearly answer this question, it just works better this way. The following graph shows how the program is executed with this solution, and the graph contains prefetch steps from 0 to 100 on a different system (that’s why the optimum is shifted here). As you can see, the curve is quite complex.

Source code

#include <iostream>
#include <cstdint>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <immintrin.h>
#include <cassert>

#define BLOCK_COUNT 8
#define PAGE_SIZE 4096
#define TARGET_BYTE 127

#define BLOCKS_8 \
    BLOCK(0)  BLOCK(1)  BLOCK(2)  BLOCK(3) \
    BLOCK(4)  BLOCK(5)  BLOCK(6)  BLOCK(7)

#define BLOCK(offset) \
    "vmovntdqa    " #offset "*4096(%6,%2,1),%4\n\t" \
    "vpcmpeqb     %4,%7,%4\n\t" \
    "vmovntdqa    " #offset "*4096+0x20(%6,%2,1),%3\n\t" \
    "vpcmpeqb     %3,%7,%3\n\t" \
    "vpsubb       %4,%0,%0\n\t" \
    "vpsubb       %3,%1,%1\n\t" \
    "prefetcht0   " #offset "*4096+4*64(%6,%2,1)\n\t"


static inline
__m256i hsum_epu8_epu64(__m256i v) {
    return _mm256_sad_epu8(v, _mm256_setzero_si256());
}

int main() {
    struct stat sb;
    assert(fstat(STDIN_FILENO, &sb) != -1);
    size_t length = sb.st_size;

    char* start = static_cast<char*>(mmap(nullptr, length, PROT_READ, MAP_PRIVATE | MAP_POPULATE, STDIN_FILENO, 0));
    assert(start != MAP_FAILED);

    uint64_t count = 0;
    __m256i sum64 = _mm256_setzero_si256();
    size_t offset = 0;

    __m256i compare_value = _mm256_set1_epi8(TARGET_BYTE);
    __m256i acc1 = _mm256_set1_epi8(0);
    __m256i acc2 = _mm256_set1_epi8(0);
    __m256i temp1, temp2;

    while (offset + BLOCK_COUNT*PAGE_SIZE <= length) {
        int batch = PAGE_SIZE / 64;
        asm volatile(
            ".align 16\n\t"
            "0:\n\t"

            BLOCKS_8

            "add          $0x40, %2\n\t"
            "dec          %5\n\t"
            "jg           0b"
            : "+x" (acc1), "+x" (acc2), "+r" (offset), "+x" (temp1), "+x" (temp2), "+r" (batch)
            : "r" (start), "x" (compare_value)
            : "cc", "memory"
        );

        offset += (BLOCK_COUNT - 1)*PAGE_SIZE;

        sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(acc1));
        sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(acc2));

        acc1 = _mm256_set1_epi8(0);
        acc2 = _mm256_set1_epi8(0);
    }

    sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(acc1));
    sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(acc2));

    count += _mm256_extract_epi64(sum64, 0);
    count += _mm256_extract_epi64(sum64, 1);
    count += _mm256_extract_epi64(sum64, 2);
    count += _mm256_extract_epi64(sum64, 3);

    for (; offset < length; ++offset) {
        if (start[offset] == TARGET_BYTE) {
            ++count;
        }
    }

    std::cout << count << std::endl;
    return 0;
}

Conclusion

It's surprising how overlooked the interleaved page pattern is. As far as I remember, I have never met him in real practice. Curious! If you have ever encountered him, please tell us about it. And if I forgot about some other memory optimization options, please let me know about them too.