A cultureless tale about Dottrace
A short sketch about why it is important to consciously write every line of code, every character. And at the same time, a small master class on using dottrace and the string class at the same time.
An artistic digression, a joke on a current topic. To find out the most interesting things, reading the text under the cut is completely optional.
It was late evening. Right in the office, when there is no one else and no one will interfere. The gaze slowly moved along the entire body. Sometimes he stopped, just for a moment, at attractive places. And sometimes the gaze froze for a long time, an indecently long time, intently examining the most interesting things.
A concentrated gaze, a bitten lip, the whole body froze, only the fingers move slightly. It is obvious that this process is enjoyable and completely captivating.
– Ugh, what kind of culture is this! – suddenly rang out throughout the open space.
The face twisted disapprovingly. Disdain and dissatisfaction appeared in his gaze.
– Well, who again wrote a comparison of strings with InvariantCulture where it is not required! And right in the middle of the body of this important method!
Late evening. There is no one in the office anymore. What to do? Of course, take a trace from some application and analyze it. Let it be Snitch.Enricher. Why does it consume as many as 80 cores on peak load days! And this time the trace will be dottrace timelime. I love timelines when analyzing CPUs.
Go to the console and enter the command dottrace attach <pid> --profiling-type=Timeline
.
Time to solve the puzzle. We have as much as 30 seconds of timeline from one Snitch.Enricher replica and a desire to reduce its CPU consumption.
We open the trace and see this picture:
On the left we see a menu where we can choose what we will study. Not Selected is the “time” spent by all threads in certain stacks (and functions). Time – this is also the CPU in those methods that do not wait for something, but perform useful work. This is exactly what we need now.
In the top center we see the “Hotspots” window. These are the methods that were most often on the top of the stacks of all threads during the entire time that we recorded the trace. Great grouping, thanks dottrace, we'll use it.
Hotspots
Obviously there is a lot of garbage here: methods that wait and do nothing. This is all that with the words Wait, Sleep, … They only consumed time, not CPU. They need to be removed. I highlighted them in yellow:
Yes, it is possible to encounter a situation where these “expectations” actually consumed a significant amount of resources. But they are extremely rare and it is almost always worth starting research by excluding them. Right-click and click Exclude method.
Let's see what's left. There is still a lot of unpleasant stuff in the top hotspots:
Thread pool
First on the list is the thread pool method. Do we believe that the thread pool spent 59% of our CPU? More likely no than yes, because we have no other hints of problems with the thread pool. The application consistently consumes 2-10 worker threads. Opening codelook at the method PortableThreadPool.WorkerThreadStart()
and literally the first line there is semaphore.Wait().
Yes, SpinWaits are possible there. But it is logical to assume that most often the application falls into full-fledged waiting at this point.
Since this is most likely just a wait and is currently preventing us from conducting an analysis, we can safely throw it out of the trace. We reduce the complexity of the observed picture.
Sockets
The third method in the list is a socket method, a loop waiting for work. Opening code method SocketAsyncEngine.EventLoop()
and there is literally the first line – Interop.Sys.WaitForSocketEvents()
. Do we believe that our application is very actively communicating with someone on the network? More likely no than yes, we know that Snitch.Enricher is literally a single-threaded number cruncher that reads one network stream and writes the result to another.
Since this is most likely just a wait and is currently preventing us from conducting an analysis, we can safely throw it out of the trace. Reducing complexity.
We study the new top hotspots:
Native stack traces
Now this is where it gets interesting! There are some unrecognized stacks in the top: Native stack traces without resolved functions
. Typically, this is time (and therefore may be CPU consumption and waiting) spent somewhere in system functions that dottrace cannot parse. Well, to hell with them. Below the first line, very interesting things already stand out.
Look at the second line. This is a method from the depths of Dotnet, not our code, which is why it is colored gray (not black). In the CompareInfo class there is such a method and the truth is. Looking at it, you immediately believe that this method really consumes CPU.
Hotspots.Usages
Let's move on to the capabilities of dottrace. Let's see who called this method!
Yeah, all roads lead to methods ProcessHttpSpan
or ProcessCustomSpan
from class SpanReaderWriter
. Either directly or transitively via Normalize
And IsError
. Good narrowing of the search range. But overall, nothing surprising. These methods are the core work of the application. For each element (span) subtracted from the stream, one of the methods is called ProcessXXXSpan
. Let's get into code!
If you could open the code SpanReaderWriter.ProcessHttpSpan
and read it very carefully, then you would not have found anything about hashcodes from strings. In methods ProcessCustomSpan
, Normalize
And IsError
– Same.
Why did this happen? Trace recording and analysis tools analyze an application after it has been converted to JIT code (and optimized along the way), and then the JIT code has been compiled (and optimized) into assembler. And one of the frequent optimizations is inlining. This is when the body of the called method is embedded entirely in the body of the caller. This is done because calling a method is a relatively expensive instruction. This means that in the trace, in the call tree, we can see those methods and instructions that were called any deeper in the C# code. And what we see is the method call IcuGetHashCode
inside a method ProcessHttpSpan
it only means that somewhere at any depth in the call chain inside ProcessHttpSpan
was IcuGetHashCode
.
You can go read ALL the code inside ProcessHttpSpan
all method calls, even system ones, to find the place where they call GetHashCode
from lines. But this is the latest and quite expensive method. In the meantime, let's try deduction!
Deductive method
Lay out the line System.Globalization.CompareInfo.IcuGetHashCodeOfString(ReadOnlySpan, CompareOptions)
for keywords:
HashCode from the line – yes, very important.
CompareOptions, potentially with options about IgnoreCase – remembered.
The CompareInfo class operates on CultureInfo, and even from the System.Globalization namespace – perhaps an important detail, this is about all sorts of InvariantCulture.
How often do you take hashes from strings in code? Well, it happens, but it’s still rare. Much more often they are put into collections, for example HashSet or Dictionary, as a key.
So, we should be looking for something that often takes a hash code from strings, potentially with options about IgnoreCase and culture? Most likely it is a hash set or a dictionary!
Call Tree
Tab Hotspots
in fact, just auxiliary. What else can you extract from a trace? You can pay attention to the lower (and, in my opinion, main) part, which is called Call Tree
. This is how it looks after we have excluded from consideration all the functions that are not interesting to us:
Yes, unfortunately, for a multi-threaded application that actively uses the thread pool, the call tree is very inconvenient. In fact, we can encounter as many different trees as there are asynchronous methods in our code. Because the “call tree” starts every time with the threadpool method ThreadStart()
and then any other asynchronous method, before which the state machine was interrupted by true asynchronous wait.
But you can still use it. You can consider several chains from the top. So, expanding the top one, in this case we see this (yes, I clicked “expand” on each opening line until I got to the very bottom):
Yes, we see confirmation of our hypothesis here. ProcessHttpSpan
somewhere deeper on the stack calls FindValue
on the dictionary that calls GetHashCode
which causes IcuGetHashCodeOfString
. But, again, all intermediate method calls are missed here.
Looking in the code
Let's see if we can use HashSet or Dictionary with some StringComparer
. What would this look like in code?
var dictionary = new Dictionary<string, string>(StringComparer.xxx);
I looked for uses StringComparer
in our code. It's pretty easy – write it down somewhere StringComparer
and click the hotkey to search for uses.
I won't show a screenshot of the solution search because it has found quite a lot of uses in projects that Snitch.Enricher doesn't depend on. And also some of the uses were in methods that obviously do not affect our ProcessHttpSpan
And ProcessCustomSpan
. I needed to filter about a dozen uses StringComparer
which are not interesting to us. I did this relatively easily thanks to my knowledge of the project structure.
And I was very lucky. There is only one suspicious example left:
public class PerServiceSettings<TValue> : Dictionary<string, TValue>
{
public PerServiceSettings(Dictionary<string, TValue> dictionary)
: base(dictionary, StringComparer.InvariantCultureIgnoreCase)
{
}
}
And this readonly dictionary is accessed once for each call of all those methods that were highlighted in the trace: ProcessHttpSpan
, Normalize
And IsError
. An inversion is simply reading the value by key.
What's wrong with this method? Pulling a value from a dictionary by key may not be that expensive (compared to a ton of other work that is done in these methods). The eye clings only to InvariantCultureIgnoreCase
. Indeed, why do we need InvariantCulture if we just want to retrieve settings by service name? We don't need any umlauts or Chinese characters (service names consist only of url-friendly characters). We don't need to run this on users' desktops. We don't need to compare all the letters in the correct order. ё
.
A story about the letter e
By the way, .Net6 did not compare strings correctly until our engineer M took part in force Microsoft will fix this.
Benchmark
How long InvariantCulture
worse than Ordinal
?
Regardless of whether we need or do not need exactly InvariantCulture
comparisons, let's check anyway, what will change if we replace InvariantCultureIgnoreCase
on OrdinalIgnoreCase
?
TryGetValue
[GlobalSetup]
public void SetUp()
{
dictionary1 = new Dictionary<string, int>();
dictionary2 = new Dictionary<string, int>(comparer: StringComparer.Ordinal);
dictionary3 = new Dictionary<string, int>(comparer: StringComparer.InvariantCulture);
dictionary5 = new Dictionary<string, int>(comparer: StringComparer.OrdinalIgnoreCase);
dictionary6 = new Dictionary<string, int>(comparer: StringComparer.InvariantCultureIgnoreCase);
keys = new List<string>();
for (var i = 0; i < 10_000; i++)
{
var str = Guid.NewGuid().ToString();
//специально, чтобы избегать сравнения по ссылке
keys.Add(Guid.Parse(str).ToString());
dictionary1[str] = i;
dictionary2[str] = i;
dictionary3[str] = i;
dictionary5[str] = i;
dictionary6[str] = i;
}
}
...
[Benchmark]
public int TryGetOrdinalIgnoreCase()
{
var summ = 0;
for (var i = 0; i < keys.Count; i++)
{
if (dictionary5.TryGetValue(keys[i], out var value))
summ += value;
}
return summ;
}
[Benchmark]
public int TryGetInvariantCultureIgnoreCase()
{
var summ = 0;
for (var i = 0; i < keys.Count; i++)
{
if (dictionary6.TryGetValue(keys[i], out var value))
summ += value;
}
return summ;
}
No comments are needed for the benchmark results on Windows.
| Method | Mean |
|--------------------------------- |------------:|
| TryGet | 370.5 us |
| TryGetOrdinal | 353.6 us |
| TryGetInvariant | 11,616.0 us |
| TryGetOrdinalIgnoreCase | 589.4 us |
| TryGetInvariantCultureIgnoreCase | 11,547.5 us |
Well, perhaps it’s worth running such a benchmark not only on Windows, but also on Linux. Still, culture extraction is highly dependent on the OS.
| Method | Mean |
|--------------------------------- |-----------:|
| TryGet | 365.8 us |
| TryGetOrdinal | 364.5 us |
| TryGetInvariant | 9,029.4 us |
| TryGetOrdinalIgnoreCase | 580.1 us |
| TryGetInvariantCultureIgnoreCase | 8,783.4 us |
Surprisingly, Linux is a little faster. Usually it's the other way around. In general, everything is clear. Regardless of the OS, the difference is 15-30 times. Feel free to change InvariantCultureIgnoreCase
on OrdinalIgnoreCase
in our case!
IndexOf, Equals
Let's look for more uses InvariantCultureIgnoreCase
And InvariantCulture
in the solution? Not only in class StringComparer
but also StringComparison
(argument for String.Equals, etc.).
There were two of them:
service.IndexOf(" via ", StringComparison.InvariantCultureIgnoreCase);
...
normalizedPath.Equals(filteredPrefix, StringComparison.InvariantCultureIgnoreCase))
Both of these lines occur within our studied methods. They may not appear at the top of the trace, but since we’re on the subject, let’s examine how much their performance depends on culture.
private const string service = "Kontur.что-то там via Kontur.что-то там другое";
[Benchmark]
public int GetOrdinalIgnoreCase()
{
return service.IndexOf(" via ", StringComparison.OrdinalIgnoreCase);
}
[Benchmark]
public int GetInvariantCultureIgnoreCase()
{
return service.IndexOf(" via ", StringComparison.InvariantCultureIgnoreCase);
}
Here are the results.
| Method | Runtime | Mean |
|------------------------------ |--------- |---------:|
| GetOrdinalIgnoreCase | .NET 6.0 | 65.35 ns |
| GetInvariantCultureIgnoreCase | .NET 6.0 | 72.18 ns |
|------------------------------ |--------- |---------:|
| GetOrdinalIgnoreCase | .NET 7.0 | 16.10 ns |
| GetInvariantCultureIgnoreCase | .NET 7.0 | 62.72 ns |
Now on .Net6 this will not help us much, but if we were on .Net7+, the difference would already be pleasant.
Result
Well okay. We replaced InvariantCulture
on Ordinal
in a couple of places that, judging by the trace, should be hot. It remains to check how this affected the real application.
The graph clearly shows the moment of release. It seems to be a success.
It is important to ensure that this is not a restart effect and that it persists for a long time.
The graphs speak for themselves.
Well, one hour of digging and 30 seconds of writing code seems to be worth ~20 saved cores of daily consumption. And once, a randomly and carelessly chosen culture for working with strings cost ~25% of the CPU consumption in this particular Snitch.Enricher application.
How to analyze Timeline trace in dottrace when optimizing CPU
Filter out what is not interesting to you. Most often these are all sorts of methods with the words Wait, and other things that hint at waiting.
Ruthlessly exclude things that you don't understand at all, or that don't look exactly like CPU work.
Don't neglect studying .Net code on Github. And your other dependencies. This often helps.
Be prepared to do a full scan: all uses, all code, etc. Yes, you will have to do this often, there is no escape from it.
If hotspots don't help, don't be afraid of the huge variety in the Call Tree window. You just take it and climb into each more or less difficult line and dive down to the very bottom, until the division of consumption rests on one specific method or splits into many small ones.
Feel free to make hypotheses and then test them. Remember, the trace does not show all the information, but only part of it. Knowing the specifics of the project you are studying will help you a lot.
Check with benchmarks that your hypotheses are correct. In the same environment in which your application runs: .Net version, OS. This is important.
Be sure to analyze the metrics after the release and double-check the results.