We study the performance of OpenCL on CPU and compatible GPUs

Hello! My name is Mikhail Kozlov, I am an intern engineer in the mathematical library development group in YADRO. This area is actively developing on RISC-V: well-known mathematical libraries such as OpenBLAS, Eigen and many others are being ported and optimized for an open architecture. Of great interest is OpenCL, an open standard for software development for heterogeneous computing. It is used in many areas: HPC, AI/ML, AR/VR, linear algebra, where it is most widely represented with the clBLAS and CLBlast libraries.

ClBLAS is an older library, and CLBlast is a more modern library, with a built-in tuner for optimization for specific hardware. Next, I'll tell you how my team and I examined the performance of these libraries on Imagination GPUs and ARM Mali ARM. In addition, I will show how to run these libraries on a RISC-V CPU using OpenCL – more precisely, its modification of PoCL, created by the developers of the Vortex GPU.

Running CLBlast on GPU

CLBlast is an open source C++ library licensed under Apache License 2.0, using OpenCL kernels. CLBlast implements BLAS-standard functions and 9 additional BLAS-like functions. Each implementation supports five types of floating point numbers: three real types (FP16, FP32, FP64) and two complex types (2xFP32, 2xFP64).

The library has several APIs.

CLBlast API C — written in C, does not support template functions, created for compatibility with the BLAS API.
СBLast API — written in C++, supports template functions, functionality is similar to CLBlast API C.
Netlib BLAS API – written in C, supports all computing operations, needed to support applications that are based on this standard API.
CLCudaAPI — written in C++, implements a wrapper over the OpenCL API and/or CUDA API. Provides high portability between CUDA and non-CUDA devices with low overhead.

CLBlast includes 18 cores, which is significantly less than the number of supported features. There are two reasons for this. Firstly, each core can work with any of the listed data types, which are determined at the compilation stage. Secondly, many kernels can be reused in different functions: for example, the implementation of the GBMV function uses the OpenCL GEMV kernel with a change in preprocessing.

*Cores used in several functions at once, and lists of these functions*

CLBlast kernels have many parameters, which allows you to use CLTune to optimize them. The parameters are divided into two sets. The first is the most common combinations, whose impact on kernel performance on various devices is best studied. The second is all possible combinations: here the number of sets is so large that users themselves conduct experiments with different sets and propose new optimal parameters.

The user can also use CLTune to independently optimize kernels for problems with special dimensions. CLTune runs kernels with different sets of parameters and selects the best set across all runs. Here are some of these parameters as an example:

WGS (Work Group Size). This parameter is found in many kernels; it determines the number of threads, the size of the thread group, and the required local memory size. The parameter can be used in different ways in kernels. In one-dimensional problems, a group of threads with dimensions WGS x 1 is distinguished, and in two-dimensional problems – WGS x WGS. When working with large dimensions, data is transferred to the device and processed on it in portions. In such cases, WGS determines the size of the local array to be used in the calculations.
WPT (Work Per Thread). Found in AXPY and GEMV nuclei. This parameter determines the number of elements sequentially located in local memory that will be processed by one thread. With an optimally selected parameter, you can reduce the number of memory requests several times.
VW (Vector Width). Found explicitly in the AXPY and GEMV kernels, it is used in the vector implementation of vector multiplication. The parameter specifies the number of elements that are placed in the register.

Below is the architecture of the library:

Testing the CLBlast library

Before using a new library, you need to make sure that it is functionally correct, especially on a new architecture. To do this, we will run functional tests on the platforms under study.

The functional testing system in CLBlast is structured as follows. First, we launch the implementation of the function under test from the reference library: it can be clBLAS or some CPU-BLAS library. I chose OpenBLAS – this is the most popular library of its kind, well optimized for specific hardware. We convinced in its correctness, having personally corrected many problems.

Then we run the same function from CLBlast, compare the received answers or, when comparing with clBLAS, the returned code. As a result of the launch, I received a fairly large list of crashes and incorrect answers. All failed tests can be divided into five groups.

Return codes do not match. The testing system expects the same error codes from CLBlast and clBLAS. But the functions in the libraries are implemented differently and require different amounts of resources. For example, CLBlast has an improved version of GEMM that uses an additional buffer; clBLAS does not have such an implementation. So the test for insufficient size of the additional buffer fails: we receive an error code from CLBlast, and a success code from clBLAS.

Working with half precision numbers. The CLBlast testing system is universal when working with all types of data. The only exceptions are some functions over half-precision numbers. C++ does not support fp16 numbers, so they are represented in code as a sequence of bits written in an unsigned short type – this is the standard way to represent them in OpenCL. For such numbers, methods have been implemented to convert them to the fp32 type and vice versa.

At first, difficulties arose with the function that returns the index of the maximum/minimum element: it differs in that it returns an integer rather than a fraction. The remaining functions interpret the bit sequence returned by the kernel as a floating point number (an array of floating point numbers), so they call the function for converting from fp16 to fp32. Due to the universality of the test system, the bit sequence returned by iAMAX will also be perceived as fp16, while at the same time being an integer.

tester.cpp

bool TestSimilarity(const half val1, const half val2) { 
 const auto kErrorMarginRelative = getRelativeErrorMargin<half>(); 
 const auto kErrorMarginAbsolute = getAbsoluteErrorMargin<half>(); 
 return TestSimilarityNear(HalfToFloat(val1), HalfToFloat(val2), 
 kErrorMarginAbsolute, kErrorMarginRelative); 
}

The problem is of a slightly different nature – in testing the HGEMBATCHED and HAXPYBATHED functions. The BATCHED version of the algorithm is the combination of n calls of one kernel into one call. In this case, memory is allocated for the input and output data of all calls at once, and calculations occur simultaneously. The matrix multiplication and vector addition functions also have scalar parameters.

In BATCHED versions, the test is implemented in such a way that each of several grouped calls receives different values for these parameters. On each subsequent call, the value is increased by 1. When working with half-precision numbers, addition occurs not according to the rules of floating point numbers, but according to the rules of integer numbers. The result of the addition will be interpreted as fp16, which converts scalar parameters to values of type nan.

xgemmbatched.hpp

args.alphas[batch] = args.alpha + Constant<T>(static_cast<double>(batch + 1)); 
args.betas[batch] = args.beta + Constant<T>(static_cast<double>(batch + 1));

Incorrect reference work. In clBLAS, the TRSM function receives a different response than when using the same function in CLBlast and OpenBLAS.

Problems in OpenCL kernels. The GEMMBATCHED and AXPYBATHED function kernels attempted to read an array of 16-bit numbers as an array of 32-bit numbers when working with half-precision numbers. This is due to incorrect work with macros.

xgemmbatched.opencl

void XaxpyBatched(const int n, const __constant real_arg* arg_alphas, 
 const __global real* restrict xgm, const __constant int* x_offsets, const int x_inc,  __global real* ygm, const __constant int* y_offsets, const int y_inc) {  const int batch = get_group_id(1); 
 const real alpha = GetRealArg(arg_alphas[batch]);

common.opencl

#if PRECISION == 16 
 typedef float real_arg;  
 #define GetRealArg(x) (half)x  
#else  
 typedef real real_arg;  
 #define GetRealArg(x) x

Problems with individual devices. All the described problems were reproduced on all configurations participating in the test. But there were also cases when problems arose only on certain boards. For example, the ASUM function did not work correctly on the ARM Mali GPU.

We have fixed issues regarding the test system and kernels not working correctly with half-precision numbers. CLBlas is no longer particularly relevant, and perhaps we will get around to it later, and third-party hardware is beyond our control, so we can only study the causes of the problem. Therefore, we decided not to consider a number of such problematic functions in performance tests.

The problem with mismatched return codes is fundamental to the test system. We came to the conclusion that fixing it would require a deep rework of the entire system. After making sure that the returned CLBlast statuses were correct and as expected, we decided to postpone this issue.

Experimental results

We selected three popular functions from different levels of the BLAS API as tests:

The first level is AXPY, a vector addition operation related to memory-bound tasks.
The second level is GEMV, matrix-vector multiplication.
The third level is GEMM, matrix multiplication, compute-bound problem.

All calculations were performed on single precision numbers. Using the built-in tuner in CLBlast in our case did not give a performance increase – most often the default parameters remained the best for launching. On the GPU on the VIM4 board with the new parameters, functions began to run slower. This may be due to the fact that tuning is performed only for a specific dimension. The test configurations were as follows:

OpenBLAS for LicheePI was compiled with support for the V0p7 vector extension to run in multi-threaded mode – 4 threads on RISC-V, 8 threads on ARM. GPU performance indicators were obtained using the clpeak utility, which implements synthetic benchmarks on OpenCL.

AXPY

CLBlast outperforms clBLAS by 25–50% on ARM boards, lags behind by almost 2.5 times on BXM-4-64 and performs on par on BXE-2-32. On the Mali G610, OpenBLAS is 5% faster than clBLAS and 25% slower than CLBlast. On other maps, OpenBLAS is 5–17% ahead of its competitors.

GEMV

The performance of clBLAS varies quite a bit depending on the size of the problem. In comparison, CLBlast is on average 70% slower, but only on BXM-4-64 and on larger dimensions. On other accelerators, CLBlast, on the contrary, is 75–100% more productive than clBLAS.

As for OpenBLAS, on Imagination GPU it is on average 55–70% slower than clBLAS. On Mali G610, its lag behind CLBlast and clBLAS is 80 and 8%, respectively. But on the Mali G52, OpenBLAS performance is 2–3.5 times higher than its competitors.

GEMM

At higher dimensions, CLBlast outperforms clBLAS by 45–65%. On BXM-4-64 CLBlast works 2.5 times faster. On BXM-4-64 OpenBLAS is ahead of CLBlast by 10%, on Mali G52 by 56%. On BXE-2-32 and Mali G610, the performance of OpenBLAS is lower than that of OpenBLAS by 30 and 44%, respectively.

From the results of runs on the GPU, we can conclude that in most cases CLBlast significantly outperforms clBLAS, with the only exception being runs of the memory-bound benchmarks SAXPY and SGEMV on LicheePi. Also, CLBlast on Mali G610 and BXE-2-32 performs better than OpenBLAS in all tasks.

Comparison of performance of different GPUs

The results obtained correlate well with the characteristics of the GPUs under consideration: more productive ARM solutions show better results. The Mali G610 has the highest performance: on AXPY and GEMM, the accelerator is about 2.2 times faster than the Mali G52 with the closest power, and the GEMV on average works 4 times faster.

The G610 also demonstrates interesting dynamics in matrix-vector multiplication: at higher dimensions there is a decrease in performance. This may be due to the fact that as the number of elements increases, the share of operations for moving data from the CPU to the GPU in the total time also increases.

GPUs from Imagination behave very curiously. On AXPY and GEMV, the BXM-4-64 accelerator is slower than the BXE-2-32, although the former GPU performs better in the clpeak benchmarks. The reason for this may be the slow transfer of data from the CPU to the GPU on the LicheePi board. On GEMM, the performance of these GPUs compares in the same way as in the clpeak benchmarks: the BXM-4-64 is approximately twice as fast.

Running OpenCL on RISC-V CPU

To run OpenCL on RISC-V, we will use PoCL, a portable, open-source implementation of OpenCL. It runs on a wide range of devices, including general purpose CPUs and GPUs, as well as other custom accelerators.

The PoCL architecture includes a runtime library that implements the OpenCL API and an LLVM-based compiler for compiling kernels. The following modifications were made specifically to support RISC-V in PoCL:

The operation of PoCL on RISC-V is described in more detail in article Vortex GPU developers.

added new supported devices for build configuration,
compilation of the runtime environment into the RISC-V library via cross-compilation is provided,
Added a new execution mode for offline kernel compilation.

The operation of PoCL on RISC-V is described in more detail in article Vortex GPU developers.

Assembling PoCL on the board did not pose any particular difficulties, but cross-compilation turned out to be a real challenge. I’m sharing the procedure that I eventually came to:

Install locally clang And LLVM for the local machine and for the board (you can simply copy it from the board /usr/lib/llvm-xwhere x is the version of LLVM on the board). The versions of both assemblies must match.
If necessary, install on the board ocl-icd And libhwloc and copy their files from the board to the folder liblocated in the installation directory LLVMassembled for the board.
Install pkg-config on the machine on which the build is taking place and set the environment variable.

export PKG_CONFIG_PATH=/path/to/hwloc/prefix/lib/pkgconfig:/path/to/opencl/prefix/lib/pkgconfig

Copy the folder with the main libraries from the board (/usr/lib/riscv64-linux-gnu) to a folder liblocated in the installation directory LLVMassembled for the board.
Copy /usr/lib/gcc/riscv64-linux-gnu/13/libstdc++.so to a folder liblocated in the installation directory LLVMassembled for the board.
Download toolchain to use header files from a folder ris cv-gcc/sysroot/usr/include and copy the contents to the folder include from the installation directory LLVMassembled for the board.
Add compilation flags and specify folder paths in the file //ToolchainExample.cmake.

SET(CMAKE_C_COMPILER <devtoolkit dir>/riscv-gcc/bin/riscv64-unknown-linux-gnu-gcc) SET(CMAKE_CXX_COMPILER <devtoolkit dir>/sc-devtoolkit/riscv-gcc/bin/riscv64-unknown-linux-gnu-g++) 
SET(CMAKE_CXX_FLAGS "-mabi=lp64d -march=rv64imafdczbb0p93_zba0p93 -mcpu=sifive-u74 -mtune=sifive-7-series") SET(CMAKE_C_FLAGS "-mabi=lp64d -march=rv64imafdczbb0p93_zba0p93 -mcpu=sifive-u74 -mtune=sifive-7-series") SET(CMAKE_C_FLAGS "-I<hwloc install path>/include") 
SET(CMAKE_CXX_FLAGS "-Wl,-rpath-link,<llvm-x install path>/lib") 
# should work, but does not yet. Instead set FIND_ROOT below 
# set(CMAKE_SYSROOT /home/a/zynq/ZYNQ_ROOT) 
# where is the target environment 
SET(CMAKE_FIND_ROOT_PATH <llvm-x install path>/) 
# where to find libraries in target environment 
SET(CMAKE_LIBRARY_PATH <llvm-x install path>/lib)

Copy and replace the executable file llvm-config from folder bin in the installation directory LLVMcollected for the local machine – to the folder bin in the installation directory LLVMassembled for the board.
Same as with local assembly, change the file /lib/CL/pocl_llvm_wg.cc.
If at the assembly stage PoCL If there are problems with redefining types, you need to change the source files in the directory LLVMassembled for the board, and source files PoCL.

$LLVM_BUILD_PREFIX/lib/clang/x/include/stdint.h

96: - typedef __INT64_TYPE__ int64_t; 
 + typedef long long int int64_t; 
98: - typedef __UINT64_TYPE__ int64_t; 
 + typedef unisgned long long int int64_t;

/lib/kernel/printf.c

32: - typedef intptr_t ssize_t; 
 + typedef int ssize_t;

cmake -DHOST_DEVICE_BUILD_HASH=riscv64-unknown-linux-gnu-rv64gc -DLLC_HOST_CPU=generic-rv64 - DHOST_CPU_TARGET_ABI=lp64d -DENABLE_LLVM=1 -DCMAKE_TOOLCHAIN_FILE=<pocl_source_dir>/ToolchainExample.cmake - DCMAKE_PREFIX_PATH=$HOST_PREFIX -DLLC_TRIPLE=riscv64-unknown-gnu -DLLVM_HOST_TARGET=riscv64-unknown-linux gnu -DENABLE_ICD=ON(OFF) -DLLVM_BINDIR=$BUILD_PREFIX/bin <pocl_soucre_dir> 
cmake --build . --target install 
$BUILD_PREFIX - LLVM ,  
$HOST_PREFIX - LLVM,

If PoCL was compiled with the -DENABLE_ICD=ON key, you need to correct the path to the library, which is written in the file /etc/OpenCL/vendors/pocl.icd.

Experimental results

Unfortunately, we were unable to build PoCL specifically for the LicheePI board, so launches on it were carried out using PoCL compiled for VisionFive.

AXPY

In most cases, the performance difference between OpenBLAS and CLBlast is small, no more than 15–20%. A significant gap appears only on the Radaxa ROCK5 Model B board: the acceleration of CLBlast relative to OpenBLAS can reach 85%.

GEMV

From GEMV it is clear that although at lower dimensions the difference in performance between CLBlast and OpenBLAS may not be so large (10–15%), but when the number of elements increases, OpenBLAS turns out to be 85–230% more productive than CLBlast.

GEMM

In the case of large dimensions, the performance of OpenBLAS is higher than CLBlast PoCL: by 25% – on VisionFive, by 55% – on VIM4, by 110% – on Radaxa ROCK5 Model B, by 257% – on LicheePI. It is also worth noting that as the number of matrix elements increases, the performance of OpenBLAS increases, while the performance of CLBlast, on the contrary, decreases.

In general, we received the expected results. The more popular OpenBLAS library works better than the experimental and crude version of running OpenCL BLAS functions on the CPU. But in some places the lag is not so large even on a compute-bound task (GEMM VisionFive 2), and on AXPY CLBlast was even faster. This is an argument in favor of developing OpenCL on the RISC-V architecture.

Conclusion

We have verified that CLBlast shows high performance compared to clBLAS and OpenBLAS. We were able to run CLBlast on a RISC-V CPU. Most often, CLBlast is quite inferior in performance to OpenBLAS, but on memory-bound tasks, CLBlast remains at the same level or significantly outperforms OpenBLAS. When measuring the speed of GEMM on the VisionFive 2 board, we found a lag of only 25%.

These results are encouraging. If an OpenCL implementation optimized for the RISC-V architecture appears, libraries based on it will be able to get closer to more popular solutions and even overtake them.

We study the performance of OpenCL on CPU and compatible GPUs

Running CLBlast on GPU

Testing the CLBlast library

Experimental results

AXPY

GEMV

GEMM

Comparison of performance of different GPUs

Running OpenCL on RISC-V CPU

Experimental results

AXPY

GEMV

GEMM

Conclusion

History through the eyes of a 14th century Arab

Optimizing Analytics with DataHub

Book Spring Boot 2: Best Practices for Professionals

Find nearby amateur running, swimming, cycling and other sports competitions

Backpropagation algorithm in Python

How our student got an internship at VK. Artem Mazur's story

Leave a Reply Cancel reply

Running CLBlast on GPU

Testing the CLBlast library

Experimental results

AXPY

GEMV

GEMM

Comparison of performance of different GPUs

Running OpenCL on RISC-V CPU

Experimental results

AXPY

GEMV

GEMM

Conclusion

Similar Posts

Leave a Reply Cancel reply