GPU acceleration of FFmpeg. Do you see the speed increase? Neither do I. And it should be…

Preparing for the test

First, let's see what we'll be testing on.
Let's start with my personal machine, which runs EndeavourOS(Arch):

Core — 6.9.3-arch1-1, 8 gigs of DDR4 at 3200 MHz + 8 gigs of physical swap and another 8 due to zram. CPU — ryzen 5600h 6/12 with a thermal package of 35 w (although, according to nvtop, it reaches 45 w) and GPU — RTX 3050ti on 4 gb of VRAM with a thermal package of 60 w.

Nvidia driver version is 550.78-7, FFmpeg is 2:6.1.1-7.

Test server — AlmaLinux 9.3, kernel 6.6.11-1.el9.x86_64, 16 GB DDR3 and 2 GB swap.
CPU — Xeon 2630, 6 cores with SMT disabled, GPU — Quadro P400.

Nvidia driver version is 550.54.15, FFmpeg is its own build with optimizations for CUDA.

I would also like to emphasize that Xeon is an old man from 2012, and Quadro is from 2017, but despite the age, the results will be quite ambiguous.

Now let's download a test video. Let's say it's the brainchild of the Blender Foundation – Big Buck Bunny.

```bash
wget https://download.blender.org/demo/movies/BBB/bbb_sunflower_2160p_30fps_normal.mp4.zip
unzip bbb_sunflower_2160p_30fps_normal.mp4.zip
```

Testing

First, let's run a comparative test on my laptop without GPU acceleration.
We will test rescaling from 2160p to 1080p.

```bash
ffmpeg -i bbb_sunflower_2160p_30fps_normal.mp4 -vf scale=1920:1080 -c:v mpeg4 -preset medium output_no_acc.mp4
```

And we repeat this 5 times for reliability, getting the following results:

Test number

Speed

1

4.09x

2

3.95x

3

4.22x

4

4.17x

5

4.26x

Average result: 4.138x.

However, the test is not very fair, since, judging by the GPU load, in nvtop the iGPU power consumption and the percentage of load on the graphics core increase sharply. At the same time, the video memory consumption hardly changes. The Xeon in the test server does not have integrated graphics at all.

Now on the test server:

Test number

Speed

1

3.54x

2

3.52x

3

3.52x

4

3.55x

5

3.54x

The average result is 3.534x.

Modern Ryzen 5600H on 14.59% turned out to be faster than the old Xeon 2630 – the result, frankly speaking, is not very impressive. There is a gap of almost 10 years between the processors (ryazhenka was released in 2021). And on top of that, hyperthreading was enabled on its side, although something tells me that FFmpeg worked in single-thread mode in this operation. In addition, integrated graphics helped it. The server worker and PC assembler enthusiast from Aliexpress does not have integrated graphics, and hyperthreading was also strangled.

However, let's leave the oddities and questionable progress in processor performance for another post. Let's move on to what's most interesting – GPU acceleration.

CUDA

Let's start again with my laptop and repeat the tests 5 times.

```bash
ffmpeg -hwaccel cuda -i input.mp4 -vf "scale=1920:1080,format=yuv420p" -c:v mpeg4 -preset medium cuda.mp4
``` 

Test number

Speed

1

4.75x

As you can see, something doesn't add up here. GPU acceleration should by definition be noticeably faster in processing video material, but we got the result only slightly faster than the processor with integrated graphics. How so?
At the same time, if you look at nvtop monitoring, the discrete graphics were clearly loaded.

Okay, something went wrong once, but there are still 4 tests ahead. Just in case, we will additionally specify the use of discrete graphics using the prime-run tool.

```bash
prime-run ffmpeg -hwaccel cuda -i input.mp4 -vf "scale=1920:1080,format=yuv420p" -c:v mpeg4 -preset medium cuda.mp4
```

And we get… 4.77?

Test number

Speed

2

4.77x

My honest reaction:

Okay, let's assume that prime-run is not enough, although nvtop again showed that the GPU was loaded. Let's move on to the heavy artillery and use EnvyControl we'll force the system to use only discrete graphics. At the same time, we'll run a new CPU test, which the iGPU, I hope, won't be able to help cheat against the old Xeon.

```bash
sudo envycontrol -s nvidia --force-comp --coolbits 24
reboot
```

Great, now we have a black screen both on the Wayland session and on X11.
Now FFmpeg surprises, and now EnvyControl, which worked stably before this experiment.

Are you a Brute?

Okay, let's go to TTY with Ctrl+Alt+F3, enter login, password and return everything back via EnvyControl:

```bash
sudo envycontrol --reset
``

And, thank God, the Keds' Wayland session is up and running again, hurray!
But I still haven't made any progress on the test.
I'll try to switch again only to discrete, but also through the indication of the SDDM screen manager:

```bash
sudo envycontrol -s nvidia --dm sddm 
reboot 
```
* E[ДАННЫЕ УДАЛЕНЫ]yep. Maybe…

* E[ДАННЫЕ УДАЛЕНЫ]yep. Maybe…

Wayland again pleased me with a black screen when trying to enter the system, in X11 it was possible to enter without problems, but nvtop shows that the iGPU continues to work together with the discrete one.
Okay, EnvyControl is not very encouraging at the moment. Let's resort to its scary older cousin – OptimusManager.

It works only with X, unlike EnvyControl, which previously worked fine under Wayland. Let's try:

```bash
optimus-manager --switch nvidia
``

And we get an error with an instruction to read the logs with a bunch of other errors, reading which did not make it any clearer.

Okay, looks like I'll have to continue the test without switching to Nvidia mode.

```bash
ERROR: the latest GPU setup attempt failed at Xorg pre-start hook.
Log at /var/log/optimus-manager/switch/switch-20240603T140533.log
Cannot execute command because of previous errors.
```

We continue to torture FFmpeg and me in an attempt to understand why everything is as it is, and not as it seemed to be. So be it, I will continue to use prime-run, although it seems to change nothing.

```bash
prime-run ffmpeg -hwaccel cuda -i input.mp4 -vf "scale=1920:1080,format=yuv420p" -c:v mpeg4 -preset medium cuda.mp4
```

Test number

Speed

3

4.78x

4

4.79x

5

4.81x

Total average: 4.78x.

Well, frankly speaking, I expected different results. In my opinion, everything should have sped up at least 2 times. Just like it happens in conditional DaVinci Resolve, KdenLive and other video editors, where FFmpeg is most likely buried somewhere under the hood in the backend.

In the end, the difference is only 15.5% relative to Ryzen 5600H.
Well, okay, let's go back to the remote server and run the test there:

```bash
ffmpeg -hwaccel cuda -i input.mp4 -vf "scale=1920:1080,format=yuv420p" -c:v mpeg4 -preset medium cuda.mp4
```

The beginning is already pleasing: there were no surprises, and at least some pattern appeared. The PCI slot plug with 2 gigs of VRAM turned out to be weaker than both the processors and its more recent brother from the RTX line. True, the difference is not that impressive.
The Quadro P400 has a FP32 performance of 0.64 TFLOPS. And the mobile RTX 3050TI, depending on the thermal package, should have about 8.7 TFLOPS.

Test number

Speed

1

2.99x

2

2.99x

3

2.99x

4

2.99x

5

2.99x

Stable, even suspiciously stable – 2.99x.

And with a difference of 13.59 times in terms of dry performance in TFLOPS in FFmpeg we see a difference of only 1.59 times.

Differences in FFmpeg performance on CPU and GPU

Let's now move from practice to theory, how all this should work in theory.
It seems strange to mention the theory's promise that GPUs should be noticeably faster. Using two different systems as an example, we were convinced that, at least in their case, this is not the case.
Judging by how rarely anyone remembers about GPU acceleration in FFmpeg, and when it finally comes to this, then in Google you can see other, equally unfortunate people, whose GPU turned out to be barely faster, or even slower.

Process of work on the processor

When transcoding on the CPU, FFmpeg does the following:

  1. Splitting a container into separate streams (audio, video).

  2. Decoding streams into their raw formats.

  3. Applying filters to these streams (eg scaling to 720p to reduce file size).

  4. Encoding streams into specified formats.

  5. Multiplexing streams back into one file.

Process of working on the Video Card

NVIDIA Video Codec SDK

When we enable hardware acceleration, some of these steps can be performed on the GPU:

  1. Stream decoding occurs on NVDEC (Nvidia Video Decoder).

  2. Filtering (e.g. scaling) can be performed on the GPU using CUDA.

  3. Stream encoding is performed on NVENC (Nvidia Video Encoder).

  4. However, audio streams are still processed on the CPU, since NVENC/NVDEC are intended only for video.

  5. After decoding, raw video frames are sent to VRAM for accelerated filtering.

  6. After filtering, the frames are encoded and returned to the system's main memory for multiplexing and completion of the process.

Creeping doubts

There is clearly a bug in my tests…
Maybe there's something wrong with the ill-fated video from the Blender Foundation – we should check what codec it uses.

```bash
ffprobe -v error -select_streams v:0 -show_entries stream=codec_name -of default=noprint_wrappers=1:nokey=1 bbb_sunflower_2160p_30fps_normal.mp4
```

h264 is used. So, h264, well, that's right, the norm for mpeg4 container… stop.
Our container is .mp4, which stands for mpeg4, but our codecs are h264/h265/266, or software libx264, etc.
So, what about me?

*realization comes that he caught himself in the Joker's trap*

Damn, I put mpeg4 everywhere out of habit, having gotten used to considering it synonymous with the .mp4 video container in general.

Although, mpeg4 in Nvidia documentation for 2023 is still listed as hardware acceleratedunlike the diagram I provided above in the section ““Work process on the Video Card”.
However, the slowness of the old man mpeg4 processing is probably due to the fact that, being from 1999, it was less effective in data compression. And also, probably, it was deprived of attention from Nvidia itself, since there are more effective software and hardware codecs to replace it, which are its descendants.

Let's start over

We continue the Sisyphean labor and will now try to run tests on a laptop and a test server with a normal codec (h265).
At the same time, while reading the documentation, I realized that my resolution scaling was done on the CPU, but it was also possible to use CUDA.

```bash
ffmpeg-cuda -hwaccel cuda -hwaccel_output_format cuda -i bbb_sunflower_2160p_30fps_normal.mp4 -vf "scale_cuda=1920:1080" -c:v hevc_nvenc -preset medium output_cuda.mp4
``
  1. RTX 3050Ti:

Test number

Speed

1

6.59x

2

6.59x

3

6.57x

4

6.56x

5

6.56x

– Average: 6.576x
Difference h265 vs mpeg 4: +37.5%
Difference with Ryzen 5600H: +297%
– Difference with Quadro P400: +245.7%

  1. Quadro P400:

Test number

Speed

1

2.68x

2

2.68x

3

2.67x

4

2.67x

5

2.67x

– Average: 2.676x
Difference h265 vs mpeg 4: –11.7%
Difference with Ryzen 5600H: +92.5%
Difference with RTX 3050Ti: –245.7%

```bash
ffmpeg -i bbb_sunflower_2160p_30fps_normal.mp4 -vf "scale=1920:1080" -c:v libx265 -preset medium output_libx265.mp4
```
  1. Ryzen 5600H:

Test number

Speed

1

1.4x

2

1.37x

3

1.42x

4

1.35x

5

1.41x

– Average: 1.39
Difference h265 vs mpeg 4: -297%
Difference with Quadro P400: -92.5%
Difference with RTX 3050Ti: -473%

  1. Xeon 2630:

  • Unfortunately, our FFmpeg build does not have libx265/264 support, so Xeon is out.

The results are already more like the truth. But there were some anomalies – Quardo P400 in h265 suddenly showed itself to be 11.7% worse.

I wonder how something more powerful from the server segment will perform in comparison to laptop hardware, rather than the aged hardware of the used computer assembler's dream “for study” category? Let's try it!

The Experiments column has gone too far. Wake up, Mr. Freeman.

The Experiments column has gone too far. Wake up, Mr. Freeman.

Test server 2 — AlmaLinux 9.4, kernel 6.6.31-1.el9.x86_64, 7 GB DDR4 and 2 GB swap.
CPU — EPYC 7551P, 32 cores with disabled Hyper-threading, GPU — RTX A2000.

Nvidia driver version is 555.42.02, 6 gigs of VRAM, FFmpeg is its own build with optimizations for CUDA.

In FP32, the mobile RTX 3050Ti and RTX A2000 should have roughly the same results. The only difference is that the A2000 has more video memory — 6 gigs versus 4.

```bash
ffmpeg-cuda -hwaccel cuda -hwaccel_output_format cuda -i bbb_sunflower_2160p_30fps_normal.mp4 -vf "scale_cuda=1920:1080" -c:v hevc_nvenc -preset medium output_cuda.mp4
``

RTX A2000:

Test number

Speed

1

6.24x

2

6.81x

3

6.8x

4

6.79x

5

6.76x

– Average: 6.68x
RTX 3050Ti Difference: +1.58%
– Difference with Ryzen 5600H: +6.43%
Difference with Xeon 2630: +89%
Difference with Quadro P400: +249%
– Difference with EPYC 7551P: +237%

As a result, we get a similar result within the error limits. This time, the Teraflops matched reality, but there is a snag with EPYC: there is no AMD AMF on the server, and without it, Epics are not capable of hardware acceleration and have no analogues of libx264/265.

Well then, let's try old man mpeg4 again.

```bash
ffmpeg -i bbb_sunflower_2160p_30fps_normal.mp4 -vf scale=1920:1080 -c:v mpeg4 -preset medium output_no_acc.mp4
```

EPYC 7551P:

Test number

Speed

1

2.89x

2

2.79x

3

2.83x

4

2.77x

5

2.81x

– Average: 2.818x
RTX 3050Ti Difference: -301%
Difference with Ryzen 5600H: -89.8%
Difference with Xeon 2630: -62%
Difference with Quadro P400: +5.3%
Difference with A2000: +57%

```bash
ffmpeg-cuda -hwaccel cuda -i bbb_sunflower_2160p_30fps_normal.mp4 -vf "scale=1920:1080,format=yuv420p" -c:v mpeg4 -preset medium output_no_cuda_scale.mp4
```

And on the GPU –

RTX A2000:

Test number

Speed

1

1.76x

2

1.81x

3

1.8x

4

1.79x

5

1.77x

– Average: 1.786
RTX 3050Ti Difference: -267%
Difference with Ryzen 5600H: -231%
Difference with Xeon 2630: -97.8%
Difference with Quadro P400: -67.4%
Difference with EPYC 7551P: -57%

AMD, of course, knows how to surprise, but often likes to disappoint. I expected a different result from 32 cores. Apparently, Epics are not at all tuned for working with multimedia content. And I dare to assume that the suspiciously good result of Ryzen5600H was due to the use of iGPU.
However, Nvidia was not without its mysticism and oddities – the A2000 coped with mpeg4 much worse than the RTX 3050Ti. In terms of performance, they are almost identical, so I assume that the gap is caused by different software versions on my laptop and the test server.

Now in h264

For the sake of completeness of the experiment, let's also see if anything changes if we keep the codec the same as the original video – h264?

```bash
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i bbb_sunflower_2160p_30fps_normal.mp4 -vf "scale_cuda=1920:1080" -c:v h264_nvenc -preset medium output_cuda.mp4
```
  1. RTX A2000:

Test number

Speed

1

6.24x

2

6.79x

3

6.81x

4

6.8x

5

6.8x

– Average: 6.68x
– Difference with RTX A2000(H265): 0%

  1. Quadro P400:

Test number

Speed

1

2.66x

2

2.66x

3

2.66x

4

2.66x

5

2.66x

– Average: 2.66x
– Difference with Quadro P400(H265): -12.4%

  1. RTX 3050Ti:

Test number

Speed

1

6.66x

2

6.66x

3

6.69x

4

6.65x

5

6.66x

– Average: 6.66x
– Difference with RTX 3050Ti(H265): +1.27%

```bash
ffmpeg -i bbb_sunflower_2160p_30fps_normal.mp4 -vf "scale=1920:1080" -c:v libx264 -preset medium output_cpu.mp4
```
  1. Ryzen 5600H:

Test number

Speed

1

2.49x

2

2.48x

3

2.47

4

2.46

5

2.48

– Average: 2.476x
– Difference with Ryzen5600H(H265): +78.12%

The results are, on the one hand, logical, and on the other, surprising again. RTX A2000 showed no changes at all between H264 and H265, RTX 3050Ti within the margin of error, Quadro P400 suddenly showed itself noticeably worse with H264, and Ryzen 5600H, on the contrary, coped much better with H264.

Final table

Was

Test configuration

Original codec

Final codec

Average

Ryzen 5600H

h264

mpeg4

4.138x

RTX 3050Ti

h264

mpeg4

4.78x

Xeon 2630

h264

mpeg4

3.534x

Quadro P400

h264

mpeg4

2.99x

EPYC 7551P

h264

mpeg4

2.818x

RTX A2000

h264

mpeg4

1.786x

Became (H265)

Test configuration

Original codec

Final codec

Average

RTX A2000

h264

h265

6.68x

RTX 3050Ti

h264

h265

6.576x

Quadro P400

h264

h265

2.676x

Ryzen 5600H

h264

h265

1.39x

Became (H264)

Test configuration

Original codec

Final codec

Average

RTX A2000

h264

h264

6.68x

RTX 3050Ti

h264

h264

6.66x

Quadro P400

h264

h264

2.66x

Ryzen 5600H

h264

h264

2.476x

conclusions

I know that I know nothing. (c) Perhaps, Socrates, but even this is not certain.

Skeletor I'll be back in the next series, where we'll probably touch on FFmpeg with transcoding again. Or with a new, less painful topic to discuss.
Thanks for reading, I look forward to your comments!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *