Periodically, my subscribers ask me questions about how to conduct microarchitectural analysis in Windows? To be honest, it never posed any particular problem for me. Guess why? Because I work for Intel and of course I have a license to use Intel® VTune ™ Amplifier… Therefore, I cannot fully feel the pain of people who are busy with performance-related work on Windows and do not have access to Vtune or AMD CodeAnalyst… Since this was not a problem for me, I did nothing to solve it. Finally, I recently viewed Bartek’s blog on coding and came across an article “A curious case of branching performance“. This seemed to me to be a case that can be easily verified by simply running perf stat if we were working in Linux. But since we’re on Windows … it’s not that simple.
In this article, I want to present one of the ways to collect PMU counters without Intel® VTune ™ Amplifier. I took almost all the necessary information from blog Bruce Dawson. He wrote articlewhich I want to expand and make it more step by step. That is, all the laurels here belong to Bruce, because I am not the author of this work. If you’d like to experiment for yourself, I suggest that you first reproduce the example described in Bruce’s article (here’s a link to github with sources and scripts).
However, do not take everything that is written in my article at face value. I’m not a Windows developer and I don’t spend a lot of time analyzing Windows performance. This is just one way to collect PMU counters – there may be others, simpler and more reliable. After all, you can always purchase Intel® VTune ™ Amplifierwhich, by the way, can be quite expensive. But I want to say right away that if you are going to conduct serious performance analysis and tuning in Windows, there are no real alternatives to Vtune (and this is not an advertisement).
What tools do you need?
xperf… You need to install Windows Performance Toolkitwhich is part of Windows Assessment and Deployment Kit (Windows ADK)… I had xperf automatically added to PATH.
tracelog… Follow this instructionsto get this tool. You need to install the following components:
Windows Driver Kit
Tracelog was not added to my PATH, but I managed to find it in the following path: “
C:Program Files (x86)Windows Kits10bin10.0.17763.0x64“This path may differ for you.
All of these kits will take some time to install, so please be patient.
Using tracelog and xperf to collect traces
I will use the example provided by Bruce, partially repeating his steps. This is how you can get traces (with information about incorrect branch predictions) from your application using the tools mentioned above (must be run as administrator):
tracelog.exe -start counters -f counters.etl -eflag CSWITCH+PROC_THREAD+LOADER -PMC BranchMispredictions,BranchInstructions:CSWITCH <your app> xperf -stop counters xperf -merge counters.etl pmc_counters_merged.etl xperf -i pmc_counters_merged.etl -o pmc_counters.txt
If we take a look at pmccounters.txt, inside we will see all traces in text format. There’s a lot to learn from them, but let’s focus on two things:
Pmc events (performance monitoring counter):
Pmc, TimeStamp, ThreadID, BranchMispredictions, BranchInstructions
CSwitch events (context switch):
CSwitch, TimeStamp, New Process Name ( PID), New TID, NPri, NQnt, TmSinceLast, WaitTime, Old Process Name ( PID), Old TID, OPri, OQnt, OldState, Wait Reason, Swapable, InSwitchTime, CPU, IdealProc, OldRemQnt, NewPriDecr, PrevCState, OldThrdBamQosLevel, NewThrdBamQosLevel
Consider this fragment of the trace:
Pmc, 214810, 5956, 1101534, 44324578 CSwitch, 214810, ConditionalCount.exe (14224), 5956, 9, -1, 6, 0, System ( 4), 560, 12, -1, Waiting, WrQueue, NonSwap, 6, 1, 3, 84017152, 0, 0, Important, Important Pmc, 214821, 14460, 1101713, 44326484 CSwitch, 214821, csrss.exe ( 888), 14460, 14, -1, 73556, 5, ConditionalCount.exe (14224), 5956, 9, -1, Waiting, WrLpcReply, Swapable, 11, 1, 3, 77701120, 0, 0, Important, Important
Note that there is a corresponding Pmc event for each CSwitch event. We can see that they have the same timestamps. In this fragment of the trace, a context switch occurred from our process (
ConditionalCount.exe) to another process (
csrss.exe). We can see this by looking at the Old Process Name (PID) of a CSwitch event with a timestamp of 214821. there was a period of time during which
ConditionalCount.exe was running on the CPU (between timestamps 214821 and 214810).
The BranchMispredictions counter is constantly increasing. We can calculate how many branching predictions were wrong during this time period by calculating the difference between these values in the two Pmc events. For this snippet, there were 1101713 – 1101534 = 179 incorrect branch predictions. By summing up all deltas, we can calculate the total number of branching prediction errors over the entire lifetime of the application.
Pro tip: If you see performance that differs from what you expected, I still recommend that you try running the same benchmark on Linux using the command
perf stat… You can find many articles on how to do this on my blog. Another way is to dump the assembly and check for the expected code. Perhaps the compiler did something clever and removed the code you wanted to test.
Analyzing traces using a Python script
To analyze traces and extract information, Bruce wrote a special script… This script retrieves PMC values for the processes we are interested in (2 arguments):
python.exe etwpmcparser.py pmccounters.txt <your app>
Here is the result I got on my computer (Win 10, Intel (R) Core (TM) i5-7300U).
Process name: branch misp rate, [br_misp, total branc] ConditionalCount.exe (14224): 21.91%, [109184040, 498250335], 3690 context switches, time: 1093072 ConditionalCount.exe (10964): 0.07%, [369677, 496453009], 761 context switches, time: 257492
Vtune shows similar results.
What other counters can we collect?
> tracelog.exe -profilesources Help Id Name Interval Min Max -------------------------------------------------------------- 0 Timer 10000 1221 1000000 2 TotalIssues 65536 4096 2147483647 6 BranchInstructions 65536 4096 2147483647 10 CacheMisses 65536 4096 2147483647 11 BranchMispredictions 65536 4096 2147483647 19 TotalCycles 65536 4096 2147483647 25 UnhaltedCoreCycles 65536 4096 2147483647 26 InstructionRetired 65536 4096 2147483647 27 UnhaltedReferenceCycles 65536 4096 2147483647 28 LLCReference 65536 4096 2147483647 29 LLCMisses 65536 4096 2147483647 30 BranchInstructionRetired 65536 4096 2147483647 31 BranchMispredictsRetired 65536 4096 2147483647
This method barely reaches what Vtune or perf can do on Linux. The number of counters is limited and this is only counting without sampling (see the difference between counting and sampling here). This is so, but at least you can do some preliminary performance analysis.
Second, if you want to collect different PMCs other than branching misprediction, you need to change more than just the team
tracelog, but also a python script that analyzes traces.
If you know of any other way to make this easier or better, let me know. I would definitely like to know him.
I hope this also helps people who use Windows and want to participate in my competition… If so, sign up using the form at the bottom of the page.
The translation of this article was prepared in advance of the start course “Load testing”…
We also invite everyone register for a demo lesson of the course on “Conducting Load Testing in the Performance Center”…
Periskop – exception monitoring tool