How do I collect CPU performance counters in Windows?

Periodically, my subscribers ask me questions about how to conduct microarchitectural analysis in Windows? To be honest, it never posed any particular problem for me. Guess why? Because I work for Intel and of course I have a license to use Intel® VTune ™ Amplifier… Therefore, I cannot fully feel the pain of people who are busy with performance-related work on Windows and do not have access to Vtune or AMD CodeAnalyst… Since this was not a problem for me, I did nothing to solve it. Finally, I recently viewed Bartek’s blog on coding and came across an article “A curious case of branching performance“. This seemed to me to be a case that can be easily verified by simply running perf stat if we were working in Linux. But since we’re on Windows … it’s not that simple.

In this article, I want to present one of the ways to collect PMU counters without Intel® VTune ™ Amplifier. I took almost all the necessary information from blog Bruce Dawson. He wrote articlewhich I want to expand and make it more step by step. That is, all the laurels here belong to Bruce, because I am not the author of this work. If you’d like to experiment for yourself, I suggest that you first reproduce the example described in Bruce’s article (here’s a link to github with sources and scripts).

However, do not take everything that is written in my article at face value. I’m not a Windows developer and I don’t spend a lot of time analyzing Windows performance. This is just one way to collect PMU counters – there may be others, simpler and more reliable. After all, you can always purchase Intel® VTune ™ Amplifierwhich, by the way, can be quite expensive. But I want to say right away that if you are going to conduct serious performance analysis and tuning in Windows, there are no real alternatives to Vtune (and this is not an advertisement).

What tools do you need?

  1. xperf… You need to install Windows Performance Toolkitwhich is part of Windows Assessment and Deployment Kit (Windows ADK)… I had xperf automatically added to PATH.

  2. tracelog… Follow this instructionsto get this tool. You need to install the following components:

    • Windows Driver Kit

    • Visual Studio

    • Windows SDK

Tracelog was not added to my PATH, but I managed to find it in the following path: “C:Program Files (x86)Windows Kits10bin10.0.17763.0x64“This path may differ for you.

All of these kits will take some time to install, so please be patient.

Using tracelog and xperf to collect traces

I will use the example provided by Bruce, partially repeating his steps. This is how you can get traces (with information about incorrect branch predictions) from your application using the tools mentioned above (must be run as administrator):

tracelog.exe -start counters -f counters.etl -eflag CSWITCH+PROC_THREAD+LOADER -PMC BranchMispredictions,BranchInstructions:CSWITCH
<your app>
xperf -stop counters
xperf -merge counters.etl pmc_counters_merged.etl
xperf -i pmc_counters_merged.etl -o pmc_counters.txt

If we take a look at pmccounters.txt, inside we will see all traces in text format. There’s a lot to learn from them, but let’s focus on two things:

  1. Pmc events (performance monitoring counter):

Pmc,  TimeStamp,   ThreadID, BranchMispredictions, BranchInstructions
  1. CSwitch events (context switch):

CSwitch,  TimeStamp, New Process Name ( PID),    New TID, NPri, NQnt, TmSinceLast, 
WaitTime, Old Process Name ( PID),    Old TID, OPri, OQnt,        OldState,      
Wait Reason, Swapable, InSwitchTime, CPU, IdealProc,  OldRemQnt, NewPriDecr, 
PrevCState, OldThrdBamQosLevel, NewThrdBamQosLevel

Consider this fragment of the trace:

Pmc,     214810,       5956, 1101534, 44324578
                CSwitch,     214810, ConditionalCount.exe (14224),       5956,    9,   -1,           6,        0,           System (   4),        560,   12,   -1,         Waiting,          WrQueue,  NonSwap,      6,   1,   3,   84017152,    0,    0,   Important,   Important
                    Pmc,     214821,      14460, 1101713, 44326484
                CSwitch,     214821,        csrss.exe ( 888),      14460,   14,   -1,       73556,        5, ConditionalCount.exe (14224),       5956,    9,   -1,         Waiting,       WrLpcReply, Swapable,     11,   1,   3,   77701120,    0,    0,   Important,   Important

Note that there is a corresponding Pmc event for each CSwitch event. We can see that they have the same timestamps. In this fragment of the trace, a context switch occurred from our process (ConditionalCount.exe) to another process (csrss.exe). We can see this by looking at the Old Process Name (PID) of a CSwitch event with a timestamp of 214821. there was a period of time during which ConditionalCount.exe was running on the CPU (between timestamps 214821 and 214810).

The BranchMispredictions counter is constantly increasing. We can calculate how many branching predictions were wrong during this time period by calculating the difference between these values ​​in the two Pmc events. For this snippet, there were 1101713 – 1101534 = 179 incorrect branch predictions. By summing up all deltas, we can calculate the total number of branching prediction errors over the entire lifetime of the application.

Pro tip: If you see performance that differs from what you expected, I still recommend that you try running the same benchmark on Linux using the command perf stat… You can find many articles on how to do this on my blog. Another way is to dump the assembly and check for the expected code. Perhaps the compiler did something clever and removed the code you wanted to test.

Analyzing traces using a Python script

To analyze traces and extract information, Bruce wrote a special script… This script retrieves PMC values ​​for the processes we are interested in (2 arguments):

python.exe pmccounters.txt <your app>

Here is the result I got on my computer (Win 10, Intel (R) Core (TM) i5-7300U).

 Process name:  branch misp rate, [br_misp, total branc]
  ConditionalCount.exe (14224):            21.91%, [109184040, 498250335], 3690 context switches, time: 1093072
  ConditionalCount.exe (10964):             0.07%, [369677, 496453009],    761 context switches,  time: 257492

Vtune shows similar results.

What other counters can we collect?

> tracelog.exe -profilesources Help
Id  Name                        Interval  Min      Max
  0 Timer                          10000  1221    1000000
  2 TotalIssues                    65536  4096 2147483647
  6 BranchInstructions             65536  4096 2147483647
 10 CacheMisses                    65536  4096 2147483647
 11 BranchMispredictions           65536  4096 2147483647
 19 TotalCycles                    65536  4096 2147483647
 25 UnhaltedCoreCycles             65536  4096 2147483647
 26 InstructionRetired             65536  4096 2147483647
 27 UnhaltedReferenceCycles        65536  4096 2147483647
 28 LLCReference                   65536  4096 2147483647
 29 LLCMisses                      65536  4096 2147483647
 30 BranchInstructionRetired       65536  4096 2147483647
 31 BranchMispredictsRetired       65536  4096 2147483647


This method barely reaches what Vtune or perf can do on Linux. The number of counters is limited and this is only counting without sampling (see the difference between counting and sampling here). This is so, but at least you can do some preliminary performance analysis.

Second, if you want to collect different PMCs other than branching misprediction, you need to change more than just the team tracelog, but also a python script that analyzes traces.

If you know of any other way to make this easier or better, let me know. I would definitely like to know him.

I hope this also helps people who use Windows and want to participate in my competition… If so, sign up using the form at the bottom of the page.

The translation of this article was prepared in advance of the start course “Load testing”

We also invite everyone register for a demo lesson of the course on “Conducting Load Testing in the Performance Center”

Read more:

  • Periskop – exception monitoring tool

Similar Posts

Leave a Reply