GPU utilization is not the most representative metric

ML teams often use the metric “GPU Utilization” to understand how actively the CPU is being used. This information is usually available by running the command nvidia-smi in a terminal prompt. Many integrated monitoring tools also track CPU utilization as a key performance metric. But sometimes, surprisingly, this metric does not give a very accurate picture of the GPU's performance. In fact, it is possible to load the GPU to 100%, performing only reads and writes (to memory), but 0 calculations. This article is not about how we figured this out, but about what we learned along the way.

Our company Trainy works on infrastructure for managing GPU clusters, so we have to think about these problems a lot. Last year, we worked on how to horizontally scale a single base model while improving the efficiency of training a large language model. In doing so, we did all the basic steps mentioned in almost all Pytorch performance tuning guides, namely:

  • Saturated the GPU by changing the default values ​​of the data loader (namely, num_workers, batch_size, pin_memorypre-fetch coefficient, etc.)

  • Made the most of the tensor core by using mixed precision (fp16, bf16)

  • Used optimizer with merge function from apex/deepspeed (eg. FusedAdam, FusedAdamWetc.)

  • Used instances/networks specifically designed for training models (H100SXM, A100SXM). Also used newer instances H100 > A100 > V100 when possible

With these simple changes, we achieved 100% GPU utilization and significant power consumption, which is great! To see if we can do better, we calculated the actual utilization (MFU) on the training workloads.

As a quick reminder, MFU stands for Model Load in FLOPS (floating point operations per second). It is one of the best metrics to judge GPU performance, first proposed in article about PaLM from Google. This is the “ratio of the observed throughput (tokens per second) to the theoretical maximum throughput of the system at peak FLOPs.” Simply put, this metric means how many floating-point operations the computer can handle with your proposed workload compared to the maximum capabilities of your GPU. The only real downside to MFU is that it can be much harder to calculate than, say, GPU utilization, since MFU depends on both the parameters you specify and the frameworks you use.

Unfortunately, when training the model, only ~20% MFU was achieved. For reference: at present, when training most LLMs, it is possible to reach a level of approximately 35% – 45% MFUSo we wondered: how is it that we are using only 20% of the theoretical maximum computing power that is built into our GPU, but at the same time our graphics processor itself is loaded at 100%?

To answer this question, let's find out what exactly is being tracked when measuring GPU load.

What is GPU load, really?

GPU load is determined rather vaguely in Nvidia documentation as “the current level of involvement in work for both the GPU's computing resources and the memory interface.” This is just a pearl.

Surprisingly, a better definition was found in the documentation for NVML by Datadog. Here, this metric is described as “The percentage of time during the period covered by the most recent sample that one or more GPU cores were busy working.” To understand why this definition is misleading, let's quickly refresh our memory on how GPUs work.

There is a GPU kernels and multiprocessor managers. In Nvidia GPUs, these multiprocessing managers are called “streaming multiprocessors” (SMs), while in AMD hardware, they are called “compute units” (CUs). The GH100 GPU shown below has 144 of these units.

These streaming multiprocessors are like foremen over groups of workers, in this case cores. When you run a CUDA core, it is the CUDA cores that do the work, and one or more SMs do this. As shown below, even a single SM on the GH100 chip has many CUDA cores.

Thus, the GPU load metric only allows us to judge whether the core is performing work at a particular moment in time. It does not show whether the CPU is using all available cores, or whether it is parallelizing the workload to utilize the maximum capabilities of the GPU. It may happen that you see 100% GPU utilization, while in fact you are only reading and writing data from memory, performing 0 FLOPS.

Now let's be clear: this metric can only be misleading to those who do not have a basic education in systems programming (for example, many ML engineers). As mentioned Herethe definition of GPU load in this form is indeed applicable within the framework “USE” methodologies .

But, returning to the problem formulated in this article, we can understand that it is precisely this difference that explains the observed gap in the percentage ratio of load for the GPU and for the MFU! There is definitely unused power left in the processor, it just needs to be fished out of it somehow.

A deeper analysis

Looking for performance reserves, the next step is definitely to profile the model training cycle. Let's look at how such a cycle is structured in the Pytorch profiler to get a better idea of ​​the situation.

As shown below, the Softmax kernel shows high GPU utilization but low SM efficiency. We found this to be a serious red flag, as the naive softmax implementation is notoriously known to be a bottleneck in LLM training. In such a situation, many operations are used nuclear fusionFor example, FlashAttention designed with softmax constraints when working with memory. Knowing this, we can draw the following conclusion: perhaps the SM performance statistics indicate a general inefficiency in the execution of the model.

But what exactly does the SM efficiency parameter show?

SM efficiency (sometimes also called SM activity) is a metric for Nvidia GPUs that measures the efficiency (in percent) of all SMs that were active during a given time interval. As mentioned above, SMs can be thought of as “masters” of groups of CUDA cores. For example, in GPU Nvidia H100 There are 132 SMs, each managing 128 CPU cores, for a total of 16,896 cores. By measuring SM efficiency, we can determine whether the CUDA cores are using our streaming multiprocessors. If we have a CUDA core that runs continuously for 10 seconds, but only uses 1 SM, then the H100 card will register 100% utilization, but the SM efficiency will be 1 / 132 = 0.7%.

Great, that's exactly what we were interested in! You can track the effectiveness of SM layer by layer, determining where it's easiest to cut a profit, that is, find potential opportunities for optimization.

We perform optimizations

Now we can easily identify which GPU cores are not working very hard and optimize those layers. Since we are dealing with a transformer stack here, the main benefit comes from merging layers in the transformer block definition. The following figure summarizes what we optimized.

By merging, we mean that we will not use the native definition of a set of layers used by PyTorch, but instead we will use a GPU kernel written in CUDA or Triton. In this case, all layers are combined into a single kernel. The speedup is achieved due to the fact that each kernel spends less time reading and writing to the GPU memory than it would have to spend when performing mathematical operations in certain layers (eg. Softmax). An example of such a kernel obtained by fusion is Flash Attention.

Did we write these kernels ourselves? Of course not. Library implementations already exist for most of them. For example, layers for Flash Attention are implemented in n.Modulesso you don't have to use kernels and write them yourself torch.autograd.function from scratch. In addition, these implementations are often already optimized from a hardware point of view, so that they not only run faster, but also use less memory.

In this case, the most difficult thing is to determine where exactly in your code the corresponding layers need to be changed. Given that (as of the time of preparation of the original of this article) torch.compile tries to do it automagically, himself torch.compile does not play well with relatively new distributed strategies such as FSDP and in practice does not provide the promised speedup, all because of graph breaks. Hopefully, in the future, torch compilers will be able to do this work for us, but for now we have to manually add the implementations obtained by merging.

As a result, we were able to achieve a 4x speedup and a 38% MFU for this particular customer, compared to a baseline MFU of 20%. Most of the optimizations we made were specific to the cores where the fusion was performed. We also made many optimizations by finding the right level of parallelism for a given model, given both the size of the model itself and the customer's 3.2 Tbps of Infiniband bandwidth.

Conclusion

Most AI teams are strongly encouraged to monitor the Streaming Multiprocessor efficiency of their GPU cluster in addition to GPU utilization. This gives a much more representative picture of how much more performance can be squeezed out of the GPU. GPU utilization, in turn, gives an indication of how much the machine is idling. Of course, it would be nice to also calculate MFU, but you are unlikely to be able to monitor this metric all the time, layer by layer. By the way, in Nvidia DCGM (GPU Manager for Data Centers) SM activity information is provided by default.

There are also many more detailed metrics, such as the SM occupancy rate (called “Achieved Occupancy” in the Pytorch profiler). It tells you how much work each SM is doing. However, understanding these metrics is not that easy; it is much more convenient to try to maximize the SM efficiency. If you want to learn more about this topic, we recommend that you read about it in blog on Pytorch Profiler, DCGM documentation, Nsight's Kernel Profiling Guide and in Nsight documentation.

Thanks for reading this far. Good luck, squeeze every last drop of performance out of your GPUs!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *