A group of users wanted to implement a simple video game in the terminal, but it turned out that its performance in Windows Terminal was completely unsuitable for such a task. The performance problem can be reproduced by repeatedly rendering the rainbow and measuring the number of frames per second (FPS). The 20-color rainbow shown here is rendered at 30 FPS on my Surface Book with an Intel i7-6700HQ. However, if you draw the same rainbow of 21 or more colors, then the frequency will drop below 10 FPS. Such a drop is stable and the situation does not worsen even with thousands of different colors.
Starting an Investigation with Windows Performance Analyzer
Of course, initially the culprit was not obvious. Is the performance drop because we are using Direct2D or DirectWrite incorrectly? Maybe the virtual terminal (VT) sequence parser is having trouble processing colors quickly? We usually start any performance-related investigation with
(WPA). It requires the creation of a trace file
; this operation can be performed using
Personally, I like WPA’s “Flame by Process” mode the most. In a flame graph, each horizontal bar represents a separate function call. The column widths correspond to the total CPU time spent in this function, including the time spent in all functions it calls recursively. This makes it easy to spot differences between two flame graphs of the same application, or to find outliers that are clearly visible as bars that are too wide.
To repeat this investigation, you will need to install Windows Terminal 1.12as well as the tool rainbowbench. After compiling rainbowbench using cmake and the compiler of your choice, you need to run the commands in Windows Terminal
rainbowbench 21for at least 10 seconds. During execution, you should have the Windows Performance Recorder (WPR) running, recording a performance trace. After that you can open the file
.etlin Windows Performance Analyzer (WPA). In the menu bar, you can give WPA the “Load Symbols” command.
On the left side of the above image we can see the CPU usage of the text rendering thread when it is constantly redrawing the same 20 colors, and on the right side we can see the CPU usage when rendering 21 colors. Thanks to the flame graph, we immediately notice significant differences in behavior within Direct2D; with a high probability their culprit is the function
in Direct2D. Atlas (“atlas”) in a graphic application is usually called
, and given that Direct2D uses the GPU for rendering by default, this is most likely the code that processes the texture atlas on the GPU. Fortunately, many tools already exist to easily debug applications running on the GPU.
PIX and RenderDoc – Convenient Debugging of Graphics Performance Issues
is an application that looks like a powerful open source project
. Both of these applications are extremely useful for debugging and understanding performance issues like this.
Although the PIX has support for packaged applications like Windows Terminal (which the PIX calls UWP) and a lot of useful metrics, I found it more convenient to generate visualizations using RenderDoc. However, in operation, both applications are almost identical, so it is easy to switch between them.
Windows Terminal comes with a modern version
OpenConsole.exe; it contains many improvements not found in conhost.exe, including alternative rendering engines. OpenConsole.exe can be opened and run inside the Windows Terminal application package or from one of the Terminal release archives. You can then create a DWORD key in
HKEY_CURRENT_USER\Console\UseDxand set it to 0 to get the classic GDI text renderer, 1 to choose the standard Direct2D renderer, or 2 to choose the new Direct3D engine that fixes this issue. This trick is useful for RenderDoc, which does not support packaged applications like Windows Terminal.
Simply drag and drop the executable into RenderDoc and select Launch. After that, snapshots will be captured, which are later analyzed and debugged.
Opening the captured data shows the rendering commands that Direct2D executed on the GPU (top image). When choosing the Texture Viewer, we initially get nothing, but as it turns out, some events in the Output tab, for example,
seem to tell us the state of the renderer at runtime. Moreover, the Input tab contains the texture D2D Internal: Grayscale Lookup Table:
The existence of such a “lookup table” (lookup table) seems to be strongly related to the fact that displaying more than 20 colors significantly slows down the application, and with a problematic function
, which we found using WPA. What if the table size is limited? To confirm our suspicions, it is enough to scroll through all the events. The table in each frame is filled hundreds of times with new colors, because you can’t fit 21 colors into a table where only 20 colors fit:
If we limit the test application to 20 colors, then the contents of the table will remain unchanged:
So, it turns out that our terminal is facing a Direct2D border case: in the general case, it is optimized for processing up to 20 colors (as of April 2022). This decision in Direct2D is not a coincidence, since using a constant size lookup table for coloring reduces its computational complexity and power consumption, especially on the old hardware for which it was written. In addition, most applications, websites, etc. do not exceed this limit, and if they do, then the text is most often static and does not need to be redrawn 60 times per second. In a terminal application, the opposite happens quite often.
Solve the problem with more aggressive caching
The solution is trivial: we’ll just create our own much larger lookup table and wrap it around Direct2D! Unfortunately, we cannot tell Direct2D to use our own cache. In fact, relying on its rendering logic for this will be problematic here at all, since the maximum number of colors must always remain finite. Therefore, in the end, we will have to write our own text renderer.
We would like to thank Joe Wilma of Alacritty for creating terminal rendering on modern GPUs, Christian Parpart of Contour for continued support and advice, and Tom Siladya for idea description. Special thanks to Casey Muratori for proposal for such a solution and Martinsh Mozheiko for providing an example HLSL shader.
Turning fonts and the glyphs they contain into rasterized images is usually very expensive, so implementing some sort of “glyph cache” will be critical to performance. A primitive way to cache a glyph might be to render it on a small texture the first time it is encountered. On subsequent occurrences, we can refer to the cached texture. In the same way that Direct2D uses a lookup table atlas for coloring, we can use our own
for caching glyphs. Instead of rendering 1000 glyphs into 1000 tiny textures, we’ll just select one huge texture and subdivide it into a grid of 1000 glyph cells.
Let’s say we have a tiny terminal that is 6 by 2 cells and we just want to render colored text “Hello, World!”. We already know that the first step is to create a texture atlas for the glyphs:
After replacing the characters and their glyphs in the terminal with references to the texture atlas, we are left with only a “metadata buffer” that has the same size as the terminal and contains color information. The texture atlas contains only unique and colorless rasterized glyph textures. But wait… Can’t we reverse this system and go back to the original inputs? And this is how our GPU shader works:
By writing a primitive pixel shader, we can copy glyphs from the atlas texture to display output directly on the GPU. Leaving aside more complex topics like
, then to colorize glyphs, it is enough to multiply the alpha mask of the glyph by any color we need. And the metadata buffer contains both elements – the index of the glyph to be copied for each grid cell, and the color in which it should be painted.
The performance gains brought about by this decision are highly dependent on the hardware. However, in general it is at least on par with a Direct2D-based renderer while avoiding all the limitations associated with glyph coloring.
We measured performance on the following hardware:
- CPU: AMD Ryzen 9 5950X
- GPU: NVIDIA RTX 3080
- RAM: 64GB 3200MHz CL16
- Display: 3840×2160, 60Hz
We measured the CPU and GPU load by the values in the “Task Manager”, since it is in the first place that users look when they encounter performance problems. In addition, we measured the total GPU power consumption because it is the best indicator of potential power savings, independent of frequency scaling, etc.
DxEngine is the internal name of the old Direct2D based renderer and AtlasEngine is the name of the new renderer. According to these metrics, the new renderer not only reduces overall CPU and GPU usage, but makes it independent of what is being rendered.
Direct2D implements text rendering with a built-in texture atlas into which rasterized glyphs are cached, and a lookup table for coloring those glyphs. The table is used because it reduces the computational cost of coloring glyphs, but unfortunately requires an upper bound on the number of colors it can store. If you exceed this limit and render very colorful text, then Direct2D is forced to remove some of the colors to make room for new ones, which can lead to excessively long lookup table update times, causing severe performance degradation.
For most applications, this is not a problem because the text is usually quite static or does not exceed the upper bound, but terminal applications often paint the entire background with block characters, animate text at over 60 FPS, etc., so this becomes problematic.
Our new renderer is written with modern hardware in mind and only supports rendering monospaced text in a rectangular grid. This allows us to take advantage of modern GPUs with their fast computations, support for conditional statements and branching, and relatively large amounts of memory. This allows us to safely improve performance by caching more data and doing glyph coloring without lookup tables, albeit at the cost of increased computational overhead. And by only supporting rectangular grids of monospaced text, we were able to greatly simplify the implementation, reducing additional computational overhead; at the same time, the new solution is equal in performance and efficiency to the old Direct2D-based renderer or even surpasses it.
You can see the original implementation in the pull request #11623. This pull request is quite complex, but the most important parts can be found in the subfolder
renderer/atlas. The “parser” (the part of the engine that runs on the CPU side) is in
AtlasEngine::_flushBufferLinethe pixel shader (the part of the engine that runs on the GPU side) is in
Many improvements have been added since the original pull request. The current state of the engine at the time of writing can be found here. It includes an implementation of the Direct2D and DirectWrite text blending algorithm with gamma correction, it is inside three dwrite files; there is also an implementation of ClearType mixing in the form of a GPU shader. Its independent demonstration can be viewed in the demo project dwrite-hlsl.