Large testing of video cards for machine learning

Hi all! My name is Alexey Rudak and I am the founder of Lingvanex, which deals with solutions in the field of machine translation and speech transcriptions. For our work, we constantly train language models. Our team uses dozens of different video cards, selected for different tasks: in some places you need a powerful DGX station, and in others an old gaming card like RTX 2080Ti is enough. Choosing the optimal GPU configuration will save you not only training time, but also money.

It’s interesting that there are quite a few articles on the Internet with GPU tests specifically for the speed of training language models. Mostly there are only inference tests. When the new H100 chip came out, NVidia's report stated that it was up to nine times faster than the A100 in training, but for our purposes, the new card was only 90% faster than the old one. For comparison: among our cloud providers the price difference between these GPUs was 2 times, so there was no point in switching to the new H100 to save money.

In addition to this, we tested the DGX station, which consists of 8 A100 80GB video cards and costs 10 thousand dollars per month. After the test, it became clear that the price/performance ratio of this station does not suit us at all and for this money we can take 66 x RTX 3090, which in total will bring much more benefits.

Our language models for translation have up to 500 million parameters (average from 100 million to 300 million). Perhaps if you significantly increase the number of parameters, the price/performance ratio of DGX will become better. At the moment, we do not train large language models that can immediately translate between all languages ​​in all variations, but use separate language models for each language pair, for example, English-German. Each of these models takes from 120 to 300 Mb.

It is worth noting that for different languages ​​there is a different amount of data on the Internet, and if, for example, for Spanish you can find 500 million sentences with translation, then for Tibetan no more than a million. Because of this, models for different languages ​​are trained with a different number of parameters and have different translation quality as output. To create a translation model from English to Spanish, we use a server with 4 x RTX 4500 and 256GB RAM, and the Tibetan language can be trained on an RTX 2080 Ti with 16GB RAM, since we increase the complexity of the neural network, and as a result, take a more powerful server when a small amount of data is meaningless.

Selecting GPUs and a bit of theory

The language models were trained using the OpenNMT-tf framework. This stage included such steps as preparing data, training the model and comparing it with the reference translation. Using FP16 instead of FP32 during training allowed us to significantly reduce the training time of language models without degrading the translation quality, but not all of our GPUs supported this.

When choosing GPUs, they usually take into account such indicators as processing power (TFLOPS), video memory (VRAM), support for libraries and frameworks, budget and other factors (size and form factor of the video card, power requirements, cooling and compatibility with your system) . When training text generation models, you should also remember that different languages ​​will consume different amounts of resources. For example, to encode one character of languages ​​of the Latin group, 1 byte is used, for Cyrillic 2 bytes, and for languages ​​containing hieroglyphs – 3 bytes. Understanding what characteristics your video card will have significantly affects the speed of the learning process.

When training models from the point of view of the GPUs used, video cards were conditionally divided into two groups according to the period of use: early video cards, with the help of which the first measurements of learning speed were carried out, and cards that are currently in use. The main characteristics of these video cards can be found in Tables 1 and 2, respectively.

Table 1 – Previously used graphics processors and their technical parameters

Number of GPUs

Name

VRAM, GB

CUDA

FP16, TFLOPS

FP32,TFLOPS

1

Tesla V100-SXM2

HBM2, 16

7.0

31.33

16.31

2

Tesla V100-SXM2

HBM2, 32

7.0

31.33

15.67

1

RTX 4060 Ti

GDDR6.8

8.9

22.06

22.06

1

Nvidia A40

GDDR6, 48

8.6

37.42

37.42

2

Nvidia A40

GDDR6.96

8.6

37.42

37.42

1

Nvidia A100

HBM2, 40

8.0

77.97

19.49

1

Nvidia A100

HBM2, 80

8.0

77.97

19.49

1

Nvidia RTX A6000

GDDR6, 48

8.6

38.71

38.71

1

Nvidia A10

GDDR6, 24

8.6

31.24

31.24

8

Nvidia A10

GDDR6, 192

8.6

31.24

31.24

1

Nvidia H100

HBM3, 80

9.0

204.9

51.22

Notes

1. With CUDA greater than 7.0, using FP16 will increase training speed, depending on the CUDA version and the characteristics of the video card itself.

2. If the specification for the video card states that the performance ratio of FP16 to FP32 is greater than 1 to 1, then using mixed precision will be guaranteed to increase training speed by the amount specified in the specification. For example, for a Quadro RTX 6000, an FP16 TFLOPS value of 32.62 (2:1) will speed up the training by at least two times (in practice, 2.4 times).

Table 2 – Currently used GPU models and their main characteristics

Number of GPUs

Name

VRAM, GB

CUDA

FP16, TFLOPS

FP32, TFLOPS

1

Quadro RTX 6000

GDDR6, 24

7.5

32.62

16.31

2

Quadro RTX 6000

GDDR6, 48

7.5

32.62

16.31

4

Quadro RTX 6000

GDDR6.96

7.5

32.62

16.31

2

Nvidia TITAN RTX

GDDR6, 48

7.5

32.62

16.31

4

Nvidia RTX A4500

GDDR6, 80

8.6

23.65

23.65

1

Nvidia GeForce RTX 3090

GDDR6X, 24

8.6

35.58

35.58

1

Nvidia GeForce RTX 3070

GDDR6.8

8.6

20.31

20.31

* – values ​​for FP16,TFLOPS and FP32,TFLOPS are taken from the specifications for one GPU

GPU training and testing process

The models were trained using a set of 18 GPUs. In the process of training neural networks, a large number of language pairs (more than a hundred languages) were used. During training, the following parameters of the neural network were taken as a basis:

  • vocab size = 30,000

  • numunits = 768

  • layers = 6

  • heads = 16

  • inner dimension = 4,096

To begin with, we will characterize the GPUs that belonged to the first group based on Table 1. The basis for comparing indicators will be the time in minutes and seconds spent training the model at an approximate speed of 1,000 steps and an effective batch size of 100,000 tokens.

We emphasize that for the first group, speed measurements were carried out using a mechanism alignment and only using FP32. Without using this mechanism, the learning speed on some servers can be much faster.

The alignment mechanism allows you to compare substrings in the base and translated text. It is needed for translating formatted text, such as web pages, when a substring in a sentence may be highlighted in a different font and must be translated with emphasis.

Taking into account the above-mentioned parameters of the neural network, the best time from the first table was shown by the Nvidia H100 graphics processor with a training time of 22 minutes, the intermediate time was shown by the GeForce RTX 4060 Ti graphics processor of the same brand with a training time of 72 minutes, and the graphics processor came in last place Tesla V100-SXM 2 processor with 140 minutes.

Also used in GPU testing were eight Nvidia A10 cards with a training speed of 20 minutes and 28 seconds, two Nvidia A40 cards with a training time of 56 minutes, and two Tesla V100-SXM cards with a training time of 86 minutes. Simultaneously using multiple cards of the same GPU series can speed up the training process of models and show almost the same time as graphics processes that have higher powers, but this technique may not be financially and procedurally efficient. The results of learning speed measurements can be seen in table number 3.

Table 3 – Training time measurements on previously used graphics cards (using alignment, effective batch-size =100k, fp32)

Number of GPUs used

GPU

Approximate speed (min.sec),

1,000 steps

Used

Batch size

8

Nvidia A10

20.28

6 250

1

Nvidia H100

22

25,000

1

A100 (80 Gb)

40

25,000

1

A100 (40 Gb)

56

15,000

2

Nvidia A40

56

12,500

1

RTX A6000

68.25

12,500

1

GeForce RTX 4060 Ti

72

4 167

1

Nvidia A40

82.08

12,500

2

Tesla V100-SXM

86

4 167

1

Nvidia A10

104.50

5,000

1

Tesla V100-SXM2

140

4 167

Next, we will conduct a comparative analysis of graphics accelerators currently used (table number 2). For this group of GPUs, speed measurements were carried out using the equalization mechanism (alignment), as well as using FP16 and FP32. Speed ​​measurements, including this mechanism and mixed precision, will be presented below, in tables 4 and 5, respectively.

So, having measured the speed of GPUs from this table, we can say that the first place was taken by the RTX A4500 series GPU with a training time of 31 minutes, but it should be emphasized that this speed of model training was achieved by increasing the number of units of the GPU used to 4 Without taking this fact into account, the learning speed of the above-mentioned graphics processor will be much higher, which will place it in the penultimate place in the final table.

Next, in second place, is the Quadro RTX 6000 series graphics processor with a training time of 47 minutes. It should be noted that this learning rate is inversely determined by the number of processor units used, which is four. Using only one such GPU will result in a speed loss of approximately 3.2 times and, accordingly, will be approximately 153 minutes and place it in last place.

The third place was taken by the TITAN RTX series GPU with a time of 75 minutes and 85 seconds. This learning speed indicator is associated with the use of 2 processors, which reduced the model training time.

The undisputed leader in training speed per unit will certainly be the GeForce RTX 3090 series graphics processor with a time of 78 minutes and 26 seconds. Increasing the number of units of this GPU will speed up the training speed of the model, which will clearly outperform all the above GPU models. Data on model training time measurements can be seen in table number 4.

Table 4 – Comparative analysis of the learning speed of language models on previously used GPUs (using alignment, effective batch-size =100k, fp32)

Number of GPUs

Name

Approximate speed (min.sec),

1,000 steps

Used

Batch size

4

Nvidia RTX A4500

31

5,000

4

Quadro RTX 6000

47

6 250

2

Nvidia TITAN RTX

75.85

6 250

1

GeForce RTX 3090

78.26

6 250

2

Quadro RTX 6000

88

6 250

1

GeForce RTX 3070

104.17

2,000

1

Quadro RTX 6000

153

6 250

The following learning rate measurements were performed using FP16. Compared to FP32, half precision makes it possible to reduce the amount of memory consumed when training a model and speed up GPU calculations. The translation quality of language models trained with FP16 is comparable to FP32.

Measuring the training time of models using FP32 according to the previous table, we can say that the training time of the neural network has been reduced by almost half. Based on the performance measurement results, you can see that the positions of GPUs in Table 4 remained unchanged. The Quadro RTX 6000 series card rose from fifth position to sixth, ahead of the GeForce RTX 3090 GPU by 96 seconds. The final figures are shown in Table 5.

Table 5 – Comparative analysis of the learning speed of language models on previously used GPUs (using alignment, effective batch-size =100k, fp16)

Number of GPUs

Name

Approximate speed (min.sec),

1,000 steps

Used

Batch size

4

Nvidia RTX A4500

15.81

10,000

4

Quadro RTX 6000

20.34

12,500

2

Nvidia TITAN RTX

32.68

6 250

2

Quadro RTX 6000

37.93

10,000

1

GeForce RTX 3090

38.89

10,000

1

GeForce RTX 3070

48.51

2 500

1

Quadro RTX 6000

52.56

10,000

Conclusion

In addition to choosing a GPU, it is also worth noting the choice of the optimal cloud provider. The difference in their cost can vary up to 2 times for the same server configuration. A cheap price at first glance can lead to problems with stability, lack of technical support, or debiting arbitrary sums from the card.

For our business, we use 6 different providers and have not yet decided to transfer everything to one due to various risks.

If you are involved in machine learning, then large cloud providers like Google, AWS, OVH can give you free credits worth up to 100 thousand USD per year, which you can spend on their services. There are startup support programs on their websites where you can submit an application for such a grant. They are interested in you hosting your servers with them, and the more complex your infrastructure is, the more free grant they can offer you.

Large cloud providers work only with professional GPUs of the A, L, H series. Small providers sometimes offer RTX 30-x and 40-x series gaming cards, which have half the price for the same performance. After a series of tests, we chose the Nvidia RTX 3090 as the best card for our tasks in terms of price/performance ratio. A server with one RTX 3090 and 16GB costs us about $150 per month. To train a large amount of data, we take 4 cards into one server.

If you constantly train models and plan to do this for several years, then consider building your servers on gaming video cards.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *