how Clickhouse removes its own code from memory and switches to using Huge Pages

There is an interesting code in Clickhouse: when one function is called, the memory area of ​​the program’s executable code is transferred to the use of Huge Pages. In the process, all the program code is copied to a new location, the memory that was originally used for the program code is returned to the OS, and then requested again. This article is based on the relevant part report with Y. Subbotnik.

First, we will see what virtual memory and TLB are, then we will move on to Clickhouse itself, we will see why the idea came up to do such manipulations with a binary in memory, and at the end we will see how it is all implemented.

Virtual Memory

As you know, programs on modern x86 processors do not work directly with physical memory – instead, a virtual memory mechanism is used. In this case, all memory addresses that are available to the program are virtual. Each time the memory is accessed, the processor converts the virtual address into a physical one. In order for this to work, the OS sets up the page table system. Let’s see how it happens on a 32-bit system. All memory is divided into pages. As a rule, the page has a size of 4Kb. On x86, an address in memory is 32 bits. To convert a virtual address to a physical address, the processor takes the top 10 bits of the address and reads the entry with that number from the page table. There will be the address of the second table, in which the entry is searched for under the number corresponding to the second 10 bits of the original address. Now the processor receives the address of the beginning of the desired memory page, adds the lower 12 bits of the address to it and receives a physical address where data can already be read or written.

Of course, as described, everything would work for a very long time: we wanted to go through one address from memory, and to do this, we had to read the page tables from memory twice more .. Therefore, TLB (Translation lookaside buffer) is used – this is a kind of cache .

On X86/64 (amd64) the memory handling scheme is similar, but since the address is already 64-bit instead of 32-bit, the table levels become 4 or 5 instead of two.

Large Pages

Because the TLB has a limited and not very large size, misses happen and it takes a lot of time. The developers of the Linux kernel have proposed a solution to this problem – Huge pages. These are 2MB or 1GB pages. The logic is this: there are more pages – fewer pages in total – less space in the TLB is needed and misses happen less often. Everything should work faster.

Clickhouse Performance Tests

In order to be able to make good optimizations for Clickhouse, there are performance tests there – many performance tests are launched with each commit. Since Clickhouse is a DBMS, the tests are different queries. As a result, for each request, there can be 4 verdicts:

  1. Difference not significant – no statistically significant difference

  2. Request speeded up

  3. Request slowed down

  4. The query is unstable – the variation in the query execution time is so large that it is impossible to draw a conclusion.

The last option is not very good, because it does not provide useful information about the performance impact of the commit. In addition, some unstable queries may not be defined as unstable, but a random result from the first three.

Why can requests be unstable? In order to understand, you can try to look at the metrics:

  • userspace metrics are those that Clickhouse collects itself. How long the query was running, how many rows were processed, and so on

  • OS metrics: how long the program waited in line to start, etc.

  • processor metrics: how many instructions were processed, the number of L1 cache and TLB misses

Strange Pull Request

At some point, a pull request appeared, in which type support was added Int256, Uint256, Decimal256 for counting money in cryptocurrencies. As a result, the size of the binary has grown (from 324 MB to 429 MB). Performance test slowdowns were not detected, but the number of unstable tests increased.

As a result of studying the metrics, it was found that the only metric that could be associated with an increase in the number of unstable requests is the number of TLB misses for program instructions, that is, when the processor reads instructions for execution. In other words, due to the fact that the size of the binary has grown a lot, the number of TLB misses for addresses in the code has increased, and this could lead to an increase in the number of unstable requests.

The solution is to shove the memory segment with the program code into memory using Huge Pages!

Implementation

The code that does all the magic is here.

So, we need to somehow force the OS to use large pages for memory with program instructions.
It would seem that you can make a system call madvise and tell the operating system to use large pages for the right area of ​​memory.

But it was not there. Linux will only allow large pages to be used for anonymous memory maps – that is, those that are not associated with any file. Since instruction memory is a memory mapping of a binary file, it is certainly not an anonymous mapping.

What then to do? There is a crazy solution – you need to re-allocate the code on the fly:

  1. Allocate a new piece of memory and ask the OS to use large pages for it

  2. Copy all program code there

  3. Make this code executable

  4. Send execution there

  5. Remove the mapping of the old code memory by making a munmap.

Will it work? Of course not. All function calls, conditional and unconditional jumps will try to switch to executing instructions from the old chunk of memory, for which large pages are not enabled, since Position Independent Code was not enabled during the Clickhouse build for performance reasons. With such a transition, a Segmentation Fault will occur, because the old memory is no longer there or something else is there.

But you can do it a little differently.

  1. Allocate a new piece of memory

  2. Copy all program code there

  3. Make this code executable

  4. Go there

  5. Remove mapping of old code area by calling munmap

  6. Allocate a piece of memory at the old address, the same size as it was

  7. By using madvise ask the OS to use large pages for it

  8. Rewrite the code to this memory area, mark it as executable and jump to it

  9. Victory!

We’ll see how to do it.

Function getMappedArea looks through the list of mappings into the memory of this process and finds the one in which the pointer passed to it is located – if you pass the address of some function there, it will return the mapping where the program code is located.

getMappedArea
std::pair<void *, size_t> getMappedArea(void * ptr)
{
    using namespace DB;

    uintptr_t uintptr = reinterpret_cast<uintptr_t>(ptr);
    ReadBufferFromFile in("/proc/self/maps");

    while (!in.eof())
    {
        uintptr_t begin = readAddressHex(in);
        assertChar('-', in);
        uintptr_t end = readAddressHex(in);
        skipToNextLineOrEOF(in);

        if (begin <= uintptr && uintptr < end)
            return {reinterpret_cast<void *>(begin), end - begin};
    }

    throw Exception("Cannot find mapped area for pointer", ErrorCodes::LOGICAL_ERROR);
}

Function remapExecutable calls a function getMappedAreaafter which it goes to the first step of our process in the function remapToHugeStep1.

remapExecutable
size_t remapExecutable()
{
    auto [begin, size] = getMappedArea(reinterpret_cast<void *>(remapExecutable));
    remapToHugeStep1(begin, size);
    return size;
}

Function remapToHugeStep1 allocates a new piece of memory using mmap, copies the entire program code there with the function memcpyafter which it goes to the function remapToHugeStep2but not just like that, but by switching to the code in a new area of ​​​​memory.

remapToHugeStep1
__attribute__((__noinline__)) void remapToHugeStep1(void * begin, size_t size)
{
    /// Allocate scratch area and copy the code there.

    void * scratch = mmap(nullptr, size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (MAP_FAILED == scratch)
        throwFromErrno(fmt::format("Cannot mmap {} bytes", size), ErrorCodes::CANNOT_ALLOCATE_MEMORY);

    memcpy(scratch, begin, size);

    /// Offset to the scratch area from previous location.

    int64_t offset = reinterpret_cast<intptr_t>(scratch) - reinterpret_cast<intptr_t>(begin);

    /// Jump to the next function inside the scratch area.

    reinterpret_cast<void(*)(void*, size_t, void*)>(reinterpret_cast<intptr_t>(remapToHugeStep2) + offset)(begin, size, scratch);
}

At the time of entering the function remapToHugeStep2 register instruction pointer (eip) is located in a new area of ​​memory, so the old one can be deleted. Unfortunately, you cannot use library functions, so their replacements are used.

remapToHugeStep2 removes the mapping of the binary to the old code memory area using munmap, then makes an anonymous mapping of the same size in the same place and asks the OS to use large pages for it. The program code is then copied into this new mapping and it is marked as read-only and executable (so that no more writing can be done there). We end up with a function remapToHugeStep3but in a new area of ​​memory.

remapToHugeStep2
__attribute__((__noinline__)) void remapToHugeStep2(void * begin, size_t size, void * scratch)
{
    /** Unmap old memory region with the code of our program.
      * Our instruction pointer is located inside scratch area and this function can execute after old code is unmapped.
      * But it cannot call any other functions because they are not available at usual addresses
      * - that's why we have to use "our_syscall" function and a substitution for memcpy.
      * (Relative addressing may continue to work but we should not assume that).
      */

    int64_t offset = reinterpret_cast<intptr_t>(scratch) - reinterpret_cast<intptr_t>(begin);
    int64_t (*syscall_func)(...) = reinterpret_cast<int64_t (*)(...)>(reinterpret_cast<intptr_t>(our_syscall) + offset);

    int64_t munmap_res = syscall_func(SYS_munmap, begin, size);
    if (munmap_res != 0)
        return;

    /// Map new anonymous memory region in place of old region with code.

    int64_t mmap_res = syscall_func(SYS_mmap, begin, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
    if (-1 == mmap_res)
        syscall_func(SYS_exit, 1);

    /// As the memory region is anonymous, we can do madvise with MADV_HUGEPAGE.

    syscall_func(SYS_madvise, begin, size, MADV_HUGEPAGE);

    /// Copy the code from scratch area to the old memory location.

    {
        __m128i * __restrict dst = reinterpret_cast<__m128i *>(begin);
        const __m128i * __restrict src = reinterpret_cast<const __m128i *>(scratch);
        const __m128i * __restrict src_end = reinterpret_cast<const __m128i *>(reinterpret_cast<const char *>(scratch) + size);
        while (src < src_end)
        {
            _mm_storeu_si128(dst, _mm_loadu_si128(src));

            ++dst;
            ++src;
        }
    }

    /// Make the memory area with the code executable and non-writable.

    syscall_func(SYS_mprotect, begin, size, PROT_READ | PROT_EXEC);

    /** Step 3 function should unmap the scratch area.
      * The currently executed code is located in the scratch area and cannot be removed here.
      * We have to call another function and use its address from the original location (not in scratch area).
      * To do it, we obtain its pointer and call by pointer.
      */

    void(* volatile step3)(void*, size_t, size_t) = remapToHugeStep3;
    step3(scratch, size, offset);
}

remapToHugeStep3 removes already unnecessary temporary memory with the program code and changes the return address from the function on the stack so as not to return to the temporary area, which no longer exists.

remapToHugeStep3
__attribute__((__noinline__)) void remapToHugeStep3(void * scratch, size_t size, size_t offset)
{
    /// The function should not use the stack, otherwise various optimizations, including "omit-frame-pointer" may break the code.

    /// Unmap the scratch area.
    our_syscall(SYS_munmap, scratch, size);

    /** The return address of this function is pointing to scratch area (because it was called from there).
      * But the scratch area no longer exists. We should correct the return address by subtracting the offset.
      */
    __asm__ __volatile__("subq %0, 8(%%rsp)" : : "r"(offset) : "memory");
}

That’s all, in general.

conclusions

  1. Everything works, Huge Pages are used

  2. The number of iTLB misses has decreased to almost zero

  3. The speed hasn’t changed at all.

  4. How this affected the number of unstable tests – the report does not say. If you find this information, please let me know.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *