Increase the performance of debug builds two to three times

We managed to achieve significant runtime performance improvements for the default debugging (debug) configuration of Visual Studio in the x86 / x64 C ++ compiler. For programs compiled in debug mode in Visual Studio 2019 version 16.10 Preview 2, we note an acceleration of 2–3 times. These improvements relate to reduced runtime error checking (/ RTC) overhead, which is enabled by default.

Default debug configuration

When you compile your code in Visual Studio with debugging configuration, some flags are passed to the C ++ compiler by default. Most relevant to this article are / RTC1, / JMC and / ZI

While all of these flags add useful debugging functionality, their interaction, especially when / RTC1 is present, results in significant overhead. In this release, we managed to get rid of unwanted overheads, without losing the quality of the error search and ease of the debugging process.

Consider the following simple function:

int foo() {
    return 32;
}

and the x64 assembly generated by the 16.9 compiler when compiled with the / RTC1 / JMC / ZI flags (Godbolt link):

int foo(void) PROC                  
$LN3:
        push rbp
        push rdi
        sub rsp, 232                ; дополнительное пространство, выделенное из-за /ZI, /JMC
        lea rbp, QWORD PTR [rsp+32]
        mov rdi, rsp
        mov ecx, 58                 ; (= x)
        mov eax, -858993460         ; 0xCCCCCCCC
        rep stosd                   ; записать 0xCC в стек для x DWORD’ов
        lea rcx, OFFSET FLAT:__977E49D0_example@cpp
        ; вызов из-за /JMC
        call __CheckForDebuggerJustMyCode
        mov eax, 32
        lea rsp, QWORD PTR [rbp+200]
        pop rdi
        pop rbp
        ret 0
 int foo(void) ENDP

In the assembly shown above, the / JMC and / ZI flags add a total of 232 extra bytes to the stack (line 5). This stack space is not always necessary. Combined with the / RTC1 flag, which initializes the allocated stack space (line 10), this consumes many CPU cycles. In this particular example, while the allocated stack space is necessary for / JMC and / ZI to function properly, initializing it is not. We can make sure at compile time that these checks are unnecessary. Such functions abound in any real C ++ codebase – hence the performance gain.

Next, we’ll dive deeper into each of these flags, how they interact with / RTC1, and how we avoid unnecessary overhead.

/ RTC1

Flag use / RTC1 is equivalent to using both the / RTCs and / RTCu flags. / RTCs initializes the function stack with 0xCC to perform various checks at runtime, such as detecting uninitialized local variables, detecting an array overflow or underfilling, and checking the stack pointer (for x86). You can look at the code bloated / RTC. here

As you can see from the above assembly code (line 10), the instruction rep stosdintroduced by / RTCs is the main reason for the slowdown. The situation is aggravated when / RTC (or / RTC1) is used in conjunction with / JMC, / ZI, or both.

Interaction with / JMC

/ JMC means Just My Code Debugging (debugging functionality of “just my code”), and during debugging, it automatically skips functions not written by you (eg framework, library, and other non-user code). It works by inserting a function call into the prologue that calls the runtime library. This helps the debugger distinguish between user and non-user code. The problem here is that inserting a function call in the prologue of every function in your project means that your entire project will no longer have leaf functions. If the function does not initially need any stack frame, now it will need it, because according to AMD64 ABI for Windows platforms, we need to have at least four stack slots available for function parameters (the so-called parameter home area – Param Home area). This means that all functions that were not previously initialized / RTC because they were leaf functions and did not have a frame stack will now be initialized. Having a lot of leaf functions in your program is okay, especially if you’re using a heavily templated library like the STL. In this case, / JMC will happily eat up some of your CPU cycles. This is not the case for x86 (32 bit) because we do not have a home options area there. You can watch the effects / JMC here

Interaction with / ZI

The next interaction we’ll talk about will be with / ZI. It allows your code to use the function Edit and Continue (edit and continue), which means you don’t have to recompile the entire program during debugging for small changes.

To add this functionality, we add some padding bytes to the stack (the actual number of padding bytes depends on the size of the function). This way, any new variables that you add during a debugging session can be placed in the padding area without changing the overall size of the frame stack, and you can continue debugging without having to recompile your code. You can see here how enabling this flag adds an extra 64 bytes of generated code.

As you might have guessed, the larger the stack area gets, the more things need to be initialized with / RTC, which leads to increased overhead.

Decision

The root of all these problems is unnecessary initialization. Do we really need to initialize the stack area every time? Not. In the compiler, you can safely control when initialization of the stack is really necessary. For example, you need it when there is at least one address variable, an array declared in your function, or uninitialized variables. In any other case, we can safely skip the initialization, since in any case we will not find anything useful using checks in runtime.

The situation gets a little more complicated when you compile with edit-and-continue, because now you can add uninitialized variables during debugging, which can only be discovered if we initialize the stack area. And we, most likely, did not. To solve this problem, we included the necessary bits in the debugging information and provided it via Debug Interface Access SDK… This information tells the debugger where the fill area entered by / ZI is begins and ends… It also tells the debugger if the functions are needed. stack initialization… If so, the debugger will unconditionally initialize the stack area in that memory range for functions that you edited during your debugging session. New variables are always placed on top of this initialized scope, and our runtime checks can now determine if your newly added code is safe or not.

results

We compiled the following projects in a default debug configuration and then used the generated executables to run tests. We noticed a 2-3x improvement in all projects we tested. For projects with strong use of STL, more significant improvements may be required. Let us know in the comments about any improvements you notice in your projects. Project 1 and Project 2 provided by users.

Tell us what you think!

We hope this speedup makes your debugging workflow efficient and enjoyable. We are constantly listening to your feedback and working to improve your work cycle. We’d love to hear about your experience in the comments below. You can also contact us at developer community, email (visualcpp@microsoft.com) and Twitter (@VisualC).


We remind you that today, as part of the course “C ++ Developer. Basic” the second day of the free intensive on the topic will take place: “HTTPS and threads in C ++. From simple to beautiful”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *