Large testing of video cards for machine learning
Hi all! My name is Alexey Rudak and I am the founder of Lingvanex, which deals with solutions in the field of machine translation and speech transcriptions. For our work, we constantly train language models. Our team uses dozens of different video cards, selected for different tasks: in some places you need a powerful DGX station, and in others an old gaming card like RTX 2080Ti is enough. Choosing the optimal GPU configuration will save you not only training time, but also money.
It’s interesting that there are quite a few articles on the Internet with GPU tests specifically for the speed of training language models. Mostly there are only inference tests. When the new H100 chip came out, NVidia's report stated that it was up to nine times faster than the A100 in training, but for our purposes, the new card was only 90% faster than the old one. For comparison: among our cloud providers the price difference between these GPUs was 2 times, so there was no point in switching to the new H100 to save money.
In addition to this, we tested the DGX station, which consists of 8 A100 80GB video cards and costs 10 thousand dollars per month. After the test, it became clear that the price/performance ratio of this station does not suit us at all and for this money we can take 66 x RTX 3090, which in total will bring much more benefits.
Our language models for translation have up to 500 million parameters (average from 100 million to 300 million). Perhaps if you significantly increase the number of parameters, the price/performance ratio of DGX will become better. At the moment, we do not train large language models that can immediately translate between all languages in all variations, but use separate language models for each language pair, for example, English-German. Each of these models takes from 120 to 300 Mb.
It is worth noting that for different languages there is a different amount of data on the Internet, and if, for example, for Spanish you can find 500 million sentences with translation, then for Tibetan no more than a million. Because of this, models for different languages are trained with a different number of parameters and have different translation quality as output. To create a translation model from English to Spanish, we use a server with 4 x RTX 4500 and 256GB RAM, and the Tibetan language can be trained on an RTX 2080 Ti with 16GB RAM, since we increase the complexity of the neural network, and as a result, take a more powerful server when a small amount of data is meaningless.
Selecting GPUs and a bit of theory
The language models were trained using the OpenNMT-tf framework. This stage included such steps as preparing data, training the model and comparing it with the reference translation. Using FP16 instead of FP32 during training allowed us to significantly reduce the training time of language models without degrading the translation quality, but not all of our GPUs supported this.
When choosing GPUs, they usually take into account such indicators as processing power (TFLOPS), video memory (VRAM), support for libraries and frameworks, budget and other factors (size and form factor of the video card, power requirements, cooling and compatibility with your system) . When training text generation models, you should also remember that different languages will consume different amounts of resources. For example, to encode one character of languages of the Latin group, 1 byte is used, for Cyrillic 2 bytes, and for languages containing hieroglyphs – 3 bytes. Understanding what characteristics your video card will have significantly affects the speed of the learning process.
When training models from the point of view of the GPUs used, video cards were conditionally divided into two groups according to the period of use: early video cards, with the help of which the first measurements of learning speed were carried out, and cards that are currently in use. The main characteristics of these video cards can be found in Tables 1 and 2, respectively.
Table 1 – Previously used graphics processors and their technical parameters
Number of GPUs | Name | VRAM, GB | CUDA | FP16, TFLOPS | FP32,TFLOPS |
1 | Tesla V100-SXM2 | HBM2, 16 | 7.0 | 31.33 | 16.31 |
2 | Tesla V100-SXM2 | HBM2, 32 | 7.0 | 31.33 | 15.67 |
1 | RTX 4060 Ti | GDDR6.8 | 8.9 | 22.06 | 22.06 |
1 | Nvidia A40 | GDDR6, 48 | 8.6 | 37.42 | 37.42 |
2 | Nvidia A40 | GDDR6.96 | 8.6 | 37.42 | 37.42 |
1 | Nvidia A100 | HBM2, 40 | 8.0 | 77.97 | 19.49 |
1 | Nvidia A100 | HBM2, 80 | 8.0 | 77.97 | 19.49 |
1 | Nvidia RTX A6000 | GDDR6, 48 | 8.6 | 38.71 | 38.71 |
1 | Nvidia A10 | GDDR6, 24 | 8.6 | 31.24 | 31.24 |
8 | Nvidia A10 | GDDR6, 192 | 8.6 | 31.24 | 31.24 |
1 | Nvidia H100 | HBM3, 80 | 9.0 | 204.9 | 51.22 |
Notes
1. With CUDA greater than 7.0, using FP16 will increase training speed, depending on the CUDA version and the characteristics of the video card itself.
2. If the specification for the video card states that the performance ratio of FP16 to FP32 is greater than 1 to 1, then using mixed precision will be guaranteed to increase training speed by the amount specified in the specification. For example, for a Quadro RTX 6000, an FP16 TFLOPS value of 32.62 (2:1) will speed up the training by at least two times (in practice, 2.4 times).
Table 2 – Currently used GPU models and their main characteristics
Number of GPUs | Name | VRAM, GB | CUDA | FP16, TFLOPS | FP32, TFLOPS |
1 | Quadro RTX 6000 | GDDR6, 24 | 7.5 | 32.62 | 16.31 |
2 | Quadro RTX 6000 | GDDR6, 48 | 7.5 | 32.62 | 16.31 |
4 | Quadro RTX 6000 | GDDR6.96 | 7.5 | 32.62 | 16.31 |
2 | Nvidia TITAN RTX | GDDR6, 48 | 7.5 | 32.62 | 16.31 |
4 | Nvidia RTX A4500 | GDDR6, 80 | 8.6 | 23.65 | 23.65 |
1 | Nvidia GeForce RTX 3090 | GDDR6X, 24 | 8.6 | 35.58 | 35.58 |
1 | Nvidia GeForce RTX 3070 | GDDR6.8 | 8.6 | 20.31 | 20.31 |
* – values for FP16,TFLOPS and FP32,TFLOPS are taken from the specifications for one GPU
GPU training and testing process
The models were trained using a set of 18 GPUs. In the process of training neural networks, a large number of language pairs (more than a hundred languages) were used. During training, the following parameters of the neural network were taken as a basis:
vocab size = 30,000
numunits = 768
layers = 6
heads = 16
inner dimension = 4,096
To begin with, we will characterize the GPUs that belonged to the first group based on Table 1. The basis for comparing indicators will be the time in minutes and seconds spent training the model at an approximate speed of 1,000 steps and an effective batch size of 100,000 tokens.
We emphasize that for the first group, speed measurements were carried out using a mechanism alignment and only using FP32. Without using this mechanism, the learning speed on some servers can be much faster.
The alignment mechanism allows you to compare substrings in the base and translated text. It is needed for translating formatted text, such as web pages, when a substring in a sentence may be highlighted in a different font and must be translated with emphasis.
Taking into account the above-mentioned parameters of the neural network, the best time from the first table was shown by the Nvidia H100 graphics processor with a training time of 22 minutes, the intermediate time was shown by the GeForce RTX 4060 Ti graphics processor of the same brand with a training time of 72 minutes, and the graphics processor came in last place Tesla V100-SXM 2 processor with 140 minutes.
Also used in GPU testing were eight Nvidia A10 cards with a training speed of 20 minutes and 28 seconds, two Nvidia A40 cards with a training time of 56 minutes, and two Tesla V100-SXM cards with a training time of 86 minutes. Simultaneously using multiple cards of the same GPU series can speed up the training process of models and show almost the same time as graphics processes that have higher powers, but this technique may not be financially and procedurally efficient. The results of learning speed measurements can be seen in table number 3.
Table 3 – Training time measurements on previously used graphics cards (using alignment, effective batch-size =100k, fp32)
Number of GPUs used | GPU | Approximate speed (min.sec), 1,000 steps | Used Batch size |
8 | Nvidia A10 | 20.28 | 6 250 |
1 | Nvidia H100 | 22 | 25,000 |
1 | A100 (80 Gb) | 40 | 25,000 |
1 | A100 (40 Gb) | 56 | 15,000 |
2 | Nvidia A40 | 56 | 12,500 |
1 | RTX A6000 | 68.25 | 12,500 |
1 | GeForce RTX 4060 Ti | 72 | 4 167 |
1 | Nvidia A40 | 82.08 | 12,500 |
2 | Tesla V100-SXM | 86 | 4 167 |
1 | Nvidia A10 | 104.50 | 5,000 |
1 | Tesla V100-SXM2 | 140 | 4 167 |
Next, we will conduct a comparative analysis of graphics accelerators currently used (table number 2). For this group of GPUs, speed measurements were carried out using the equalization mechanism (alignment), as well as using FP16 and FP32. Speed measurements, including this mechanism and mixed precision, will be presented below, in tables 4 and 5, respectively.
So, having measured the speed of GPUs from this table, we can say that the first place was taken by the RTX A4500 series GPU with a training time of 31 minutes, but it should be emphasized that this speed of model training was achieved by increasing the number of units of the GPU used to 4 Without taking this fact into account, the learning speed of the above-mentioned graphics processor will be much higher, which will place it in the penultimate place in the final table.
Next, in second place, is the Quadro RTX 6000 series graphics processor with a training time of 47 minutes. It should be noted that this learning rate is inversely determined by the number of processor units used, which is four. Using only one such GPU will result in a speed loss of approximately 3.2 times and, accordingly, will be approximately 153 minutes and place it in last place.
The third place was taken by the TITAN RTX series GPU with a time of 75 minutes and 85 seconds. This learning speed indicator is associated with the use of 2 processors, which reduced the model training time.
The undisputed leader in training speed per unit will certainly be the GeForce RTX 3090 series graphics processor with a time of 78 minutes and 26 seconds. Increasing the number of units of this GPU will speed up the training speed of the model, which will clearly outperform all the above GPU models. Data on model training time measurements can be seen in table number 4.
Table 4 – Comparative analysis of the learning speed of language models on previously used GPUs (using alignment, effective batch-size =100k, fp32)
Number of GPUs | Name | Approximate speed (min.sec), 1,000 steps | Used Batch size |
4 | Nvidia RTX A4500 | 31 | 5,000 |
4 | Quadro RTX 6000 | 47 | 6 250 |
2 | Nvidia TITAN RTX | 75.85 | 6 250 |
1 | GeForce RTX 3090 | 78.26 | 6 250 |
2 | Quadro RTX 6000 | 88 | 6 250 |
1 | GeForce RTX 3070 | 104.17 | 2,000 |
1 | Quadro RTX 6000 | 153 | 6 250 |
The following learning rate measurements were performed using FP16. Compared to FP32, half precision makes it possible to reduce the amount of memory consumed when training a model and speed up GPU calculations. The translation quality of language models trained with FP16 is comparable to FP32.
Measuring the training time of models using FP32 according to the previous table, we can say that the training time of the neural network has been reduced by almost half. Based on the performance measurement results, you can see that the positions of GPUs in Table 4 remained unchanged. The Quadro RTX 6000 series card rose from fifth position to sixth, ahead of the GeForce RTX 3090 GPU by 96 seconds. The final figures are shown in Table 5.
Table 5 – Comparative analysis of the learning speed of language models on previously used GPUs (using alignment, effective batch-size =100k, fp16)
Number of GPUs | Name | Approximate speed (min.sec), 1,000 steps | Used Batch size |
4 | Nvidia RTX A4500 | 15.81 | 10,000 |
4 | Quadro RTX 6000 | 20.34 | 12,500 |
2 | Nvidia TITAN RTX | 32.68 | 6 250 |
2 | Quadro RTX 6000 | 37.93 | 10,000 |
1 | GeForce RTX 3090 | 38.89 | 10,000 |
1 | GeForce RTX 3070 | 48.51 | 2 500 |
1 | Quadro RTX 6000 | 52.56 | 10,000 |
Conclusion
In addition to choosing a GPU, it is also worth noting the choice of the optimal cloud provider. The difference in their cost can vary up to 2 times for the same server configuration. A cheap price at first glance can lead to problems with stability, lack of technical support, or debiting arbitrary sums from the card.
For our business, we use 6 different providers and have not yet decided to transfer everything to one due to various risks.
If you are involved in machine learning, then large cloud providers like Google, AWS, OVH can give you free credits worth up to 100 thousand USD per year, which you can spend on their services. There are startup support programs on their websites where you can submit an application for such a grant. They are interested in you hosting your servers with them, and the more complex your infrastructure is, the more free grant they can offer you.
Large cloud providers work only with professional GPUs of the A, L, H series. Small providers sometimes offer RTX 30-x and 40-x series gaming cards, which have half the price for the same performance. After a series of tests, we chose the Nvidia RTX 3090 as the best card for our tasks in terms of price/performance ratio. A server with one RTX 3090 and 16GB costs us about $150 per month. To train a large amount of data, we take 4 cards into one server.
If you constantly train models and plan to do this for several years, then consider building your servers on gaming video cards.