Running CUDA Code on AMD Graphics Cards

Many people know that CUDA is the most commonly used platform for accelerating massive parallel computing used in various practical and research areas.

In 2016, AMD literally introduced a clone of the CUDA platform – ROCm. Alternatives to CUDA modules for ROCm can be seen in the table from AMD’s official website.

Table of correspondence of platform modules

CUDA platform module

ROCm platform module

cuBLAS

rocBLAS

cuFFT

rocFFT

cuSPARSE

rocSPARSE

cuSolver

rocSOLVER

AMG-X

rocALUTION

Thrust

rocThrust

CUB

rocPRIM

cuDNN

MIOpen

curAND

rocRAND

EIGEN

EIGEN

NCCL

RCCL

This library allows you to automatically transfer the source code intended for the CUDA platform to ROCm and compile it. One of the disadvantages of this platform is its exclusive focus on the Linux OS.

Let’s proceed directly to porting code and comparing the performance of platforms.

Test configuration

PC 1

PC 2

Operating system

Windows 10 Pro 21H1

Ubuntu 22.04

5.15.0-53-generic

CPU

x2 Intel Xeon Gold 6132

i5-12600K

RAM

x4 DDR4 16GB

x1 DDR4 32GB

GPU

GeForce RTX 3070 8GB

Radeon RX 6800XT 16GB

1. Installing CUDA on Windows OS

Go to the NVidia websitehttps://developer.nvidia.com/cuda-downloads) and download the latest version of CUDA Toolkit for the required platform. The screenshot below shows the minimum required configuration for compiling and running the CUDA platform on Windows OS.

Minimum Required Installation Configuration

2. Installing ROCm on Linux OS

Consider the progress of installing ROCm on Ubuntu 22.04. (https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.3/page/How_to_Install_ROCm.html – this website lists installation methods for some other Linux distributions)

2.1 Download the installer package and install it.

sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/5.3/ubuntu/jammy/amdgpu-install_5.3.50300-1_all.deb
sudo apt-get install ./amdgpu-install_5.3.50300-1_all.deb

2.2 Installing the required ROCm components

sudo amdgpu-install --usecase=dkms,rocm,rocmdevtools,lrt,hip,hiplibsdk,mllib,mlsdk

Errors may appear during the installation process, but they should not affect the operation of the platform in any way. In fact, I’m not 100% sure that this is the minimum required set of modules for installation, but through trial and error, I came up with this set.

2.3 Installing CUDA.

To port CUDA code to ROCm, you also need to install the CUDA Toolkit. The easiest way to do this is with the following command. (Other CUDA versions and installation methods can be found on this web page https://developer.nvidia.com/cuda-downloads)

wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run
CUDA installation configuration

3. Compiling source code on Windows OS

As a test example, let’s take the code for multiplying matrices of random integer 32-bit numbers from Github (https://github.com/lzhengchun/matrix-cuda).

Using the PowerShell commands below, download and compile the source files. After executing the commands below, the executable file “a.exe” will appear in the source code directory.

git clone https://github.com/lzhengchun/matrix-cuda
cd matrix-cuda
nvcc ./matrix_cuda.cu

4. Converting CUDA code to ROCm code and compiling it on Ubuntu OS

Converting CUDA code to ROCm is done using the ROCm HIPIFY platform utility (from HIP – ROCm Platform Programming Language)

git clone https://github.com/lzhengchun/matrix-cuda
cd matrix-cuda
/opt/rocm-5.3.0/bin/hipify-clang matrix_cuda.cu

After executing these commands, the matrix_cuda.cu.hip file will appear in the directory next to the matrix_cuda.cu file, which is the source code file for the ROCm platform.

Compiling code for the ROCm platform is done using the HIPCC compiler. After executing the commands below, the executable file “a.out” will appear in the source code directory.

/opt/rocm-5.3.0/bin/hipсс matrix_cuda.cu.hip

5. Platform Performance Comparison

Matrix size

CUDA Runtime

ROCM Runtime

1000×1000

2.536 ms

5.812 ms

10000×10000

195.123 ms

297.219 ms

In this example, we see that due to the features of the AMD architecture (fewer blocks for operations on 32-bit numbers), there is a performance lag of one and a half to two times.

Let’s convert the source files to perform operations on 16-bit numbers and test the performance of the platforms again.

Matrix size

CUDA Runtime

ROCM Runtime

1000×1000

0.83256 ms

1.421 ms

10000×10000

153.241699

16.105 ms

20000×20000

256.836761 ms

52.155 ms

In the case of operations on 16-bit numbers, the advantage is in the speed of calculations on the side of the ROCm platform.

6. Conclusion

Thus, owners of the latest generation AMD Radeon video accelerators have the opportunity to convert CUDA code into code that will also work quickly on “red” video cards in a couple of clicks.

PS

This is my first article on habr. I decided to write, because for a very long time I myself was busy with setting up this whole thing. Maybe someone will save their time with it.