PyCUDA or this code needs speedup

When writing and debugging code, the question sometimes arises “how can I increase the speed in Python”, because, as you know, the disadvantages of Python are:

Lower operating speed;
Higher memory consumption of written programs compared to similar code written in compiled languages (C or C++).

In this article, we will consider the PyCUDA library as an alternative to CUDA for C/C++. Let’s evaluate its capabilities and compare performance using a specific example, namely, we will implement the Harris algorithm for detecting corners in an image.

Today, we can process images of several million pixels in a fraction of a second, allowing security and traffic control systems to accurately navigate a rapidly changing situation. The Harris Corner Detection Algorithm is considered one of the main algorithms that allows you to mark special points on the image, the importance of which is estimated by a special function of pixel brightness.

The Harris Corner Detector (or Harris Corner Detector) is a corner detection operator commonly used in computer vision algorithms. This algorithm was first introduced by C. Harris and M. Stevens in 1988 and since then it has been improved and adopted into various algorithms for image preprocessing.

There are also many variations and additions to the Harris algorithm, such as the Shih-Tomasi corner detection algorithm, the Trajkovic-Hadley detector, and the Moravec operator. Most of them perform a brightness estimation function with some window and an activation function. Such algorithms require pre-processing of the image, such as Gaussian blur. However, none of the algorithms take into account the peculiarities of CUDA, but all of them can be easily parallelized on GPUs, since each pixel is processed separately, but it needs values from neighboring pixels.

In this article, we will use Google Colaboratory, since in the framework of the article we want to check the speed of work on the CPU and GPU, respectively. Let’s check which CPUs and GPUs will be given to us in Google Colaboratory:

First, consider the implementation of the Harris detector on the CPU. The code for Harris’ algorithm would look like this:

start = time.time()

filename="/content/drive/MyDrive/image/art_4.png"
img = cv2.imread(filename)
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

gray = np.float32(gray)
dst = cv2.cornerHarris(gray,2,3,0.04)

#result is dilated for marking the corners, not important
dst = cv2.dilate(dst,None)

# Threshold for an optimal value, it may vary depending on the image.
img[dst>0.01*dst.max()]=[0,0,255]

print('Результат алгоритма:')
cv2_imshow(img)
if cv2.waitKey(0) & 0xff == 27:
    cv2.destroyAllWindows()

end  = time.time() - start
print('Время вычисления на CPU: {:2.4f}'.format(end))

After executing this block of code, we get the result of the algorithm and the execution time of this algorithm on the CPU:

Note that the execution speed was almost one and a half seconds, which is not bad for code using OpenCV. But is it possible to manipulate the code to achieve speedup?

Let’s consider this possibility using the example of the implementation of the same code, but using the PyCUDA library.

Let’s implement the Harris algorithm for corner detection and parallelize the processes on the GPU to improve code performance.

def Harris_GPU(img, k, thresh):

    height = img.shape[0]
    width = img.shape[1]

    vector_size = img.shape[0] * img.shape[1]
    corner_list = []
    offset = 2
    # to fit still in a 32-bit integer
    thresh = int(thresh/10)

    # function template
    func_mod_template = Template("""
    #include<stdio.h>
    #define INDEX(a, b) a*${HEIGHT}+b

    __global__ void corners(
        float *dest,
        float *ixx,
        float *ixy,
        float *iyy,
        int offset,
        float k,
        int threshold) {

        unsigned int idx = threadIdx.x + threadIdx.y*blockDim.y +
                            (blockIdx.x*(blockDim.x*blockDim.y));

        unsigned int a = idx/${HEIGHT};
        unsigned int b = idx%${HEIGHT};

        float sxx = 0;
        float sxy = 0;
        float syy = 0;
        float det = 0;
        float trace = 0;
        float r = 0;

        if ((a >= offset) & (a <= (${WIDTH}-offset - 1)) &
            (b >= offset) & (b <= (${HEIGHT}-offset - 1))) {
            for (int bi = b - offset; bi < b + offset + 1; ++bi) {
                for (int ai = a - offset; ai < a + offset + 1; ++ai) {
                    sxx = sxx + ixx[INDEX(ai, bi)];
                    sxy = sxy + ixy[INDEX(ai, bi)];
                    syy = syy + iyy[INDEX(ai, bi)];
                }
            }
            det = sxx*syy - sxy*sxy;
            trace = sxx + syy;
            r = det - k*(trace*trace);
            if ((r/10) > threshold)
                dest[INDEX(a, b)] = r;
        }
    }
    """)

    func_mod = SourceModule(func_mod_template.substitute(HEIGHT=height,
                                                         WIDTH=width))
    pycuda_corners = func_mod.get_function("corners")

    # Find x and y derivatives
    dy, dx = np.gradient(img)
    Ixx = dx**2
    Ixy = dy*dx
    Iyy = dy**2

    ixx = Ixx.reshape(vector_size, order="F")
    ixy = Ixy.reshape(vector_size, order="F")
    iyy = Iyy.reshape(vector_size, order="F")
    dest_r = np.zeros_like(ixx)

    # start timer
    start = timeit.default_timer()

    pycuda_corners(drv.Out(dest_r),
                drv.In(ixx),
                drv.In(ixy),
                drv.In(iyy),
                np.uint32(offset),
                np.float32(k),
                np.uint32(thresh),
                block=(32, 32, 1),  
                grid=(height*width//1024, 1, 1))  

    # calculate used time
    execution_time = timeit.default_timer() - start

    # extract the corners
    r = np.reshape(dest_r, (height, width), order="F")
    corners = np.where(r > 0)
    for i, j in zip(corners[0], corners[1]):
        corner_list.append([j, i, r[i, j]])

    return corner_list, execution_time

Note that part of the code borrows the syntax of C/C++, which allows us to speed up the work of our code written in Python.

We get the result and the running time of the code on the GPU:

I would like to note that the implementation on the GPU does not detect all “special” points, and the whole process loops mainly on the left side of the picture, when, in turn, the implementation on the CPU detects more corners in the image. Nevertheless, the computation speed on the GPU is faster, namely, almost two times faster than on the CPU.

As a conclusion, we note that for the implementation of large and “heavy” projects or tasks where speed is important and necessary, using the PyCUDA library will be a plus (again, if you know C / C ++).

I will add that PyCUDA is an open source library and licensed by the MIT. The documentation is clear and understandable for any user, and it also has many publicly available code samples that can provide assistance and support. The main purpose of using PyCUDA is to allow the user to apply the necessary CUDA templates with a minimum of Python abstractions.