Optimizing preprocessing and postprocessing of the Yolov5 neural network using CUDA, Thrust and Nvidia Performance primitives


KDPV
KDPV

A little backstory. At our company, we decided to make a boxed project for license plate recognition (for opening barriers, registering passing cars, etc., etc.). The single-board Jetson Nano was taken as the basis.

As an experiment, ported license plate recognition, written in Python and running in the cloud, to the Jetson Nano. The results were unsatisfactory – almost full CPU load due to the fact that the OpenCV used for preprocessing was running on the CPU and the post-processing of the Yolo model was also on the CPU.

The negative points can also include:

  • Decoding on the CPU instead of the hardware decoder found on the Jetson Nano

  • Constantly copying information from GPU memory to CPU memory (Although Jetson Nano shares regular RAM between GPU and CPU).

  • Waiting GPU – 128 CUDA cores just sat around while 4 ARM CPU cores were loaded to capacity.

It was decided to rewrite all this in C ++ and, if possible, write the most optimized code, close to the Jetson Nano hardware platform, and generally tied to software solutions from Nvidia.

Lucky for us, Nvidia has made library with all sorts of accessories for the Jetson Nano. Thanks to it, you can initialize the video decoder class simply by specifying the URL of the RTSP stream, and receive in response a queue with frames directly in the CUDA video memory, decoded by the hardware decoder.

// метод для создания декодера
gstDecoder* gstDecoder::Create( const videoOptions& options )
// в output записывает адрес кадра в видеопамяти
bool gstDecoder::Capture( void** output,
                         imageFormat format,
                         uint64_t timeout )

Preprocessing

We received the frame, but what to do next? (Yolov5s accepts a 640 x 640 normalized transposed HWC to CHW image.) Image preprocessing is required:

  1. Resizing, adding a frame (to maintain proportions at a size of 640 by 640, besides, Yolo herself was trained with a frame)

  2. Converting each pixel from char to float32

  3. Normalization (so that the values ​​of each pixel are from 0 to 1.0 inclusive)

  4. Transpose HWC => CHW

To solve this problem, we decided to use the library Nvidia Performance Primitives or just NPP. I will briefly describe how it works. For example, we need to resize a three-channel image (RGB), each pixel of which is stored in float32, then we take the function nppiResize_32f_C3R() . The name of the function can be decoded like this:

  • 32f in the title says that each pixel of the image is stored in float32

  • C3 — that there are 3 channels in the image

  • R indicates that the function works by specifying ROI

The arguments to the function are as follows:

  • const Npp8u *pSrc — pointer to the beginning of the source image

  • int nSrcStep — image width step in bytes

  • NppiSize oSrcSize – width and height of the original image

  • NppiRect oSrcRectROI — ROI rectangle of the original image in the format {смещение по ширине, смещение по высоте, ширина, высота }

  • Npp8u*pDst – a pointer to the beginning of a new resized image

  • int nDstStep — width step of the new image in bytes

  • NppiSize oDstSize – width and height of the new image

  • NppiRect oDstRectROI is the ROI rectangle of the new image

  • int eInterpolation – interpolation

With NPP, it’s not easy enough – if you make a mistake with the arguments, you get either a mess instead of a new picture, or a non-zero response status upon exit. With the first one, you have to sit and think about where you made a mistake. The second is simpler – enum with errors for all occasions will tell you exactly what is wrong.

To add a calculated frame, we used nppiCopyConstBorder_32f_C3Rto normalize the image — nppiDivC_32f_C3IRwith transposition a couple of days I had to think about how best to do it, it fit perfectly nppiCopy_32f_C3P3R:

// Делим на три слоя и последовательно копируем сразу в буфер Yolov5
// по смещению размером с один слой,
// тем самым транспонируя в нужный вид (CHW).
NppStatus st;
float * const inputArr[3] {this->yoloInput,
                 					 this->yoloInput + YOLO_PLANAR_OFFSET,
                 					 this->yoloInput + (YOLO_PLANAR_OFFSET * 2)};

int planarStep = YOLO_SIZE * 1 * sizeof(IMAGE_TYPE);

st = nppiCopy_32f_C3P3R( (Npp32f*)this->yoloBuffer,
                        	this->yoloConstStep,
                        	inputArr,
                        	planarStep,
                        	yoloSize );
return st;

I did not describe the translation of the image into float32, obviously we did not do it, we are engaged in the translation into float32 library immediately upon frame capture:

// format указывает, в каком формате мы хотим получить изображение.
bool gstDecoder::Capture( void** output,
                         imageFormat format,
                         uint64_t timeout )

Regarding the performance of such preprocessing on NPP – the entire pipeline from capturing to writing to the TensorRT buffer takes 40 milliseconds on Jetson Nano – this is 25 frames per second that come from the camera, measurements of each stage separately showed approximately 10 – 15 milliseconds.

Next comes Yolov5s inference on TensorRT , Yolov5 is trained for a smaller number of classes than the standard one, the inference takes about 140-160 milliseconds.

Postprocessing

At the output of Yolov5 we get:

  • 25200 coordinates of rectangles of supposedly recognized objects

  • 25200 confidence values ​​for these rectangles

  • (25200 * Number of classes) class confidence values

    In the code, the results are presented like this:

// x,y - координаты центра прямоугольника
// w,h - ширина, высота
// s - уверенность
// cs - уверенность в каждом классе 
// NUM_CLS_SCORES - количество классов Yolov5

struct YoloOut 
{
float x;
float y;
float w;
float h;
float s;
float cs[NUM_CLS_SCORES];
};
// Результат работы - 25200 YoloOut

There are so many rectangles that Yolov5 needs post-processing: throw out the rectangles that do not pass the confidence threshold, leave only those that have the highest confidence from the remaining ones, weed out those that are too close to them but have low confidence.

picture taken from https://www.researchgate.net/figure/Non-maximal-suppression-left-all-detections-right-fused-detections_fig2_228881235
picture taken from https://www.researchgate.net/figure/Non-maximal-suppression-left-all-detections-right-fused-detections_fig2_228881235

For screening, there is a special algorithm “Suppression without a maximum” (Eng. Non-Maximum Suppression or just NMS). There are ready-made implementations of this algorithm in open source, for example in OpenCV – cv::dnn::NMSBoxes(), but they work on the CPU and therefore didn’t suit us, because first we would need to move the results to the CPU part of the RAM (TensorRT didn’t want to use the allocated memory for both the CPU and the GPU at once), and then filter everything with the help of the CPU. Therefore, it was decided to make my own NMS on CUDA, which would work quickly and not take up the CPU.

I had to understand how CUDA works on the go – grid, blocks, threads, parallelism, restrictions, qualifiers, etc. For 25200 rectangles, it was thought to make another 25200 bool values ​​in which to reflect whether the element is filtered out or not.

The code is commercial and its official publication requires bureaucracy and the permission of other people, so I publish only headers.

// Разделим-скопируем отдельно прямоугольники, уверенность и класс объекта 
//по разным местам (наверное самое не оптимизированное место
// с точки зрения памяти)
__global__
void splitBoxes( const YoloOut * yoloOut,
                	Box * box,
                objConf_t * objectConfidence,
                ClsConf * class)

// Самая простая параллельность — 25200 блоков в сетке,
// по одной нити в блоке, в качестве аргументов
// указатели на начало памяти
  
splitBoxes <<<25200,1,1>>> (	this->yoloOut,
                            this->devBoxes,
                            this->objC,
                            this->clsConf);


// фильтрация по уверенности — если ниже порога, то ставим false
__global__
void filterObjConf( const objConf_t * objectConfidence,
                   boxStatus_t * boxStatus);

// точно также параллелим
filterObjConf <<<25200,1,1>>> (	this->objC,
                               this->boxStatus);

What is the convenience of CUDA in my opinion – just specify how you need to parallelize the execution, and CUDA takes care of the rest: it distributes the calculation of 25200 elements between 128 Jetson Nano CUDA cores.

Before NMS, in addition to filtering by the confidence threshold, it is necessary to multiply the total confidence by the maximum confidence of the class and store the class id with the maximum confidence:

// ci - сохраняем отдельно id класса, c максимумом уверенности
__global__
void mulObjAndCls( objConf_t * oc,
                  const ClsConf * cs,
                  clsIdx_t * ci,
                  const boxStatus_t * bs) // уже откинутые элементы не берем
  
  
mulObjAndCls <<<25200,1,1>>> (	this->objC,
                              this->clsConf,
                              this->clsId,
                              this->boxStatus);

After filtering by the confidence threshold, there are no more than 1000 elements left (usually about 100), which need to be compactly pulled out of the array of 25200 elements YoloOut by 25200 bool keys. We would not have pulled our efficient algorithm for parallel array compression on CUDA, so the Thrust library helped us a lot to write NMS, using it to compress the results using thrust::copy_if()works really fast – measurements showed 1-3 milliseconds.

thrust::copy_if(objCTh,
                objCTh + YOLO_OUT_SIZE,
                boxStatusTh,
                this->objCThResult.begin(),
                is_true());

After compression, the NMS algorithm itself goes on, all the remaining rectangles are compared in pairs, if the threshold for the ratio of intersections and unions (Intersection over Union, IoU) is exceeded, the unsuitability flag is set, then again using thrust::copy_if() the final compression and recording of the results takes place.

#define BLOCKSIZE 32
normIteratorDevPtr<Box> boxesR = thrust::copy_if(devBoxesTh,
                                                 devBoxesTh + YOLO_OUT_SIZE,
                                                 boxStatusTh,
                                                 this->boxesThResult.begin(),
                                                 is_true());

// resultSize это количество отфильтрованных прямоугольников до NMS
int resultSize = boxesR - boxesThResult.begin();

dim3 gridSize(int(ceil(float(resultSize)/BLOCKSIZE)),
              int(ceil(float(resultSize)/BLOCKSIZE)),1);

dim3 blockSize(BLOCKSIZE, BLOCKSIZE, 1);

_nms <<< gridSize, blockSize >>>(boxesRawPtr, boxStRawPtr);

normIteratorDevPtr<Box> resultNms = 
  thrust::copy_if(this->boxesThResult.begin(),
  this->boxesThResult.end(),
  this->boxStatusNmsResult.begin(),
  this->boxesAfterNmsResult.begin(),
  is_true());

resultSize = resultNms - this->boxesAfterNmsResult.begin();

Conclusion

With this optimization, the load on the CPU has decreased to 10-15%, and the GPU is used by 99%.

Thank you for your attention.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *