a waste of resources or an opportunity for “non-Intel” servers?

Professional GPUs in servers are positioned as devices for high-performance computing, artificial intelligence systems and rendering farms for 3D graphics. Should they be used for encoding, or is it shooting sparrows from a cannon? Let’s try to figure it out.

To work with multi-threaded video, the power of modern CPUs and solutions like Intel Quick Sync are enough. Moreover, some experts believe that loading professional GPUs with decoding and encoding is a waste of resources. For consumer video cards, the number of incoming streams is specifically limited to two or three, although we have already seen that a little shamanism with the driver allows you to bypass this limitation. IN previous article household video cards were tested, and now we will deal with more serious ones – NVIDIA RTX A4000.

Preparing for testing

What if the output of lscpu gives you something like AMD Ryzen 9 5950X 16-Core Processor, but you have an NVIDIA RTX A4000 with 16 GB of RAM inserted into your computer, and you want to transcode and record a stream from several network cameras? Information from them usually comes via http, rtp or rtsp, and our task is to catch these streams, transcode them into the required format and write each one to a separate file.

To test, we at HOSTKEY created a small testbed with the above CPU/GPU configuration without special optimization and 32 GB of RAM. On it, through FFmpeg, we will receive multicast broadcasts in http and rtsp formats (we used a video file bbb_sunflower_1080p_30fps_normal.mp4 from the Blender demo repository), decode it in a different number of FFmpeg streams and write each one to a separate file. As the name suggests, we are streaming in 1080p (30 frames per second). Encoding will only be applied to video, and audio streams will go unchanged.

It is also immaterial whether we take one incoming stream and simulate its multithreading or process several streams in parallel. Networking and current processes on the test bench take up less than 1% of CPU resources, so we can assume that encoding will give the main load on the processor and disk subsystem.

All further narration will be conducted for broadcasting via http, since the results for the rtsp stream turned out to be comparable. In order not to produce a lot of terminal consoles on the server, simple bash scripts were created for the test, into which the required number of FFmpeg instances are transferred at startup, transcoding the video stream into h264.

Encoding on a bare CPU:

#!/bin/bash                                                                                                          

for (( i=0; i<$1; i++ )) do                                                                                          

ffmpeg -i http://XXX.XXX.XXX.XXX:5454/ -an -vcodec h264 -y Output-File-$i.mp4 &                      

done  

On the GPU, we will use the capabilities of the video card through NVENC (how to build FFmpeg with its support, we told in first article of the cycle):

#!/bin/bash                                                                                                          

for (( i=0; i<$1; i++ )) do                                                                                          

ffmpeg -i http://XXX.XXX.XXX.XXX:5454/ -an -vcodec h264_nvenc -y Output-File-$i.mp4 &               

done 

The scripts run in a multicast loop and catch our rabbit in the net. You should first check through the same vlc or ffplay that the stream is actually being broadcast. We will evaluate the result by CPU / GPU load, memory utilization and the quality of the recorded video, where the main parameters for us will be two parameters: fps (it must be stable and not fall below 30 frames per second) and speed (shows whether we have time to process video on the fly). For realtime, the speed parameter must be greater than 1.00x.

Subsidence of these two parameters leads to dropped frames, artifacts, encoding problems and other image damage that you would not want to see on CCTV footage.

Checking encoding on a bare CPU

Running one copy of FFmpeg gives us this initial picture:

CPU usage is averaging 18-20% per core, and the FFmpeg output shows the following:

frame=196 fps=87 q=-1.0 Lsize=2685kB time=00:00:06.43 bitrate=3419.0kbits/s speed=2.84x

There is a reserve, and you can try three streams at once:

frame=310 fps=54 q=29.0 size=4608kB time=00:00:07.63 bitrate=4945.3kbits/s speed=1.33x

Four threads take almost all the power of the CPU and “eat off” 13 GB of RAM.

That being said, the FFmpeg output shows that the reserves are not exhausted:

frame=332 fps=49 q=29.0 size=3072kB time=00:00:08.36 bitrate=3007.9kbits/s speed=1.23x

Increase the number of threads to five. The processor is kept at the limit, in some places the frame rate and bitrate drops by 5–10%:

frame=491 fps=37 q=29.0 size=4864kB time=00:00:13.66 bitrate=2915.6kbits/s speed=1.03x

Running six threads shows that the limit has been reached. We are getting more and more behind real time and start dropping frames:

frame=140 fps=23 q=29.0 size=1024kB time=00:00:01.96 bitrate=2954.4kbits/s speed=0.446x

Turn on GPU power

Run one FFmpeg stream with encoding via h264_nvenc. We make sure through the output of nvidia-smi that we have the video card involved:

Since the output is quite cumbersome, we will monitor the GPU parameters with the following command:

nvidia-smi dmon -s pucm

Let’s decipher the notation:

  • pwr – the power consumed by the video card in watts;

  • gtemp – temperature of the video core (in degrees Celsius);

  • sm — SM, meme – memory, enc – encoder, dec – decoder (the utilization of their resources is indicated as a percentage);

  • mclk – current memory frequency (in MHz), pclk — current processor frequency (in MHz);

  • fb — frame buffer usage (in MB).

gpu

pwr

gtemp

mtemp

sm

meme

enc

dec

mclk

pclk

fb

bar1

idx

W

C

C

%

%

%

%

MHz

MHz

MB

MB

0

35

48

one

0

6

0

6500

1560

213

5

In this output, we will be interested in the values ​​of GPU encoder loading and video memory utilization.

The FFmpeg output gives the following results:

frame=192 fps=96 q=23.0 Lsize=1575kB time=00:00:06.36 bitrate=2027.1kbits/s speed=3.17x

We launch five streams at once. As can be seen from the htop output, in the case of GPU encoding, the CPU load is minimal, and most of the work falls on the video card. The disk subsystem is also loaded much less.

gpu

pwr

gtemp

mtemp

sm

meme

enc

dec

mclk

pclk

fb

bar1

idx

W

C

C

%

%

%

%

MHz

MHz

MB

MB

0

36

48

8

2

40

0

6500

1560

1035

fourteen

The loading of encoding blocks increased to 40%, we took up almost a gigabyte of memory, but the video card is actually not heavily loaded. The FFmpeg output confirms this, showing that we have the resources to increase the number of threads by at least 2x:

frame=239 fps=67 q=36.0 Lsize=2063kB time=00:00:07.93 bitrate=2130.3kbits/s speed=2.22x

We put ten streams. CPU utilization at the level of 15–20%.

Graphics card options:

gpu

pwr

gtemp

mtemp

sm

meme

enc

dec

mclk

pclk

fb

bar1

idx

W

C

C

%

%

%

%

MHz

MHz

MB

MB

0

55

48

fourteen

4

61

0

6500

1920

2064

24

Power consumption increased, the video card was forced to overclock the video core frequency, but the encoding power and video memory allow increasing the load. Let’s check the FFmpeg output to make sure:

frame=1401 fps=36 q=29.0 Lsize=12085kB time=00:00:46.66 bitrate=2121.5kbits/s speed=1.2x

We try to add four more streams and get the loading of encoding blocks at 100%.

gpu

pwr

gtemp

mtemp

sm

meme

enc

dec

mclk

pclk

fb

bar1

idx

W

C

C

%

%

%

%

MHz

MHz

MB

MB

0

68

59

eighteen

7

one hundred

0

6500

1920

2886

33

The FFmpeg output confirms that we have reached the limit. CPU utilization still does not exceed 20%.

frame=668 fps=31 q=26.0 Lsize=5968kB time=00:00:22.23 bitrate=2199.0kbits/s speed=1.04x

The benchmark 15 threads show that the GPU is starting to fail as the encoders are overworked, and there is an increase in temperature and power consumption.

gpu

pwr

gtemp

mtemp

sm

meme

enc

dec

mclk

pclk

fb

bar1

idx

W

C

C

%

%

%

%

MHz

MHz

MB

MB

0

70

63

eighteen

7

one hundred

0

6500

1920

3092

35

FFmpeg also confirms that the graphics card is getting heavy. Processing frequency and frame skipping are no longer encouraging:

frame=310 fps=28 q=29.0 size=2560kB time=00:00:10.23 bitrate=2049.4kbits/s speed=0.939x

CPU vs GPU

Let’s summarize: the use of a GPU in such a configuration can be called justified, since the maximum number of streams processed by the video card is 3 times higher than the capabilities of far from the weakest processors (especially without support for hardware encoding technologies). On the other hand, we use only the minimum part of the capabilities of the video adapter. Since the rest of its blocks and video memory are not heavily loaded, the resources of an expensive device are utilized inefficiently.

Sophisticated readers may notice that we didn’t check the work in 2K/4K modes, didn’t use the capabilities of modern codecs (like h265 and VP8/9), and also installed a video adapter based on the previous generation architecture in the test bench. The same A5000 should show the best result, but we will check its work in the next article, and then we will dissect Intel Quick Sync.

Write in the comments what other nuances should be taken into account when testing, what points we missed and what you would like to know about this topic.

Similar Posts

Leave a Reply