launch of instance segmentation on Chinese single-board

YOLACT on a RockChip edge device. Even though the launch process took longer than I expected, I decided to share my work with you to help other developers who may be faced with the same problem. Finally I found a way to run yolact, which allowed us to achieve high performance and quality of the model. I hope that my experience will be useful to you and help you avoid the mistakes that I made. Enjoy reading!

It just so happens that many articles, reviews and manuals are devoted to the problems of image classification, detection and segmentation of objects in images. In general, the topic of computer vision algorithms is interesting. At the same time, when it comes to instance-segmentation, the authors limit themselves to an overview Mask-RCNN And Fast R-CNN, networks that came out more than seven years ago. Launching these networks in real time is, in principle, impossible even on powerful hardware, let alone single-board devices. Trying to fill this gap and having a powerful and cheap “firefly” at hand, I came across yolact.

A little about yolact itself

Article “YOLACT. Real-time Instance Segmentation” came out in 2019. And at that time, the grid was four times more productive than other algorithms solving a similar problem (33 fps versus 8). In this case, the accuracy (mAp) y yolact was about 17 percent lower than the famous Mask-RCNN. The authors state that yolact This is the first instance-segmentation network working in “real time”, which is very, very impressive.

It must be said that instance-segmentation is the most interesting segmentation task, since it most closely repeats the process that occurs in our brain when we look at surrounding objects. We see not only the boundaries of objects, but also understand which class these objects belong to. Yolact does this by using a “pyramid of convolutional layers” that extract feature maps of different scales at the base. And Protonet, used to generate k-masks (k = 32). As the authors write, the network architecture was based on RetinaNet + FPN.

A little about RockChip

Firefly RK3588 has on board a neural processor (NPU) with three cores with a capacity of up to 6 TOPS. And in order to use its full potential (run inference on the NPU), it is necessary to convert the model into a certain format (graph) – rknn, and compress the model weights (quantize) to an accuracy of 16 or 8 bits. This can be done using a small script, an example of which is in the official firefly repositories. Conversion to rknn format can be done directly from model.ptand also from model.onnx.

Overview of yolact projects

dbolya. There are several implementation repositories on the Internet. yolact. First of all, this is a repository from the authors of the original article: dbolya. The project is not user-friendly (despite the voluminous “Readme”), but it runs out of the box. The big disadvantage of this project is the inability to convert the model into onnx format. The author admits that he is not going to rewrite the model to add the ability to convert to onnx, since this reduces its performance. Converting directly to rknn is also not possible. Therefore, if you are going to run segmentation on a computer, server, etc., you can use this project and don’t need to read further.

Ma-Dan. Luckily, there is a fork of yolact from Ma-Dan, in which “unnecessary” decorators and all sorts of other jit compilers are removed from the model class, which allows you to convert it into onnx format. But the Ma-Dan model has a strange output (Fig. 1), which is why the conversion to the rknn format required to run on firefly does not work. Probably, this “output” can be removed and everything will work, but I have not checked this.

Fig.1.  A strange, unconnected (unlike the other 4) network output.

Fig.1. A strange, unconnected (unlike the other 4) network output.

PINTO. Another version yolact can be found among hundreds of other models from the author PINTO0309 (Honestly, I don’t know what this man does in his free time, but I suspect that he lives on Venus, since there are ~5832 hours in a day, otherwise how to explain his number of “contributions”). All you need is to download the repository, download models and post-processes prepared by the author (in onnx format) using download.sh – script. Congratulations, you are 90% closer to launch. yolact on an edge device.

Fig.2.  Yolact inference results demonstrated by PINTO0309

Fig.2. Yolact inference results demonstrated by PINTO0309

There are several points here. First, you need to convert the model (for example, “yolact_base_54_800 000_550×550”) into rknn format. The numbers at the end of the model name indicate the size of the input image. Secondly, you need to convert “postprocess550×550” to rknn format. And do this impossible not so simple – hello layers “reshape 1×30 963”. It’s easier to start the post process using onnxruntime. Thirdly, what the author calls post-process (postprocess550×550.onnx), this is actually just part of the post process. So, if you compare it with the original repository (dbolya/yolact), it turns out that this is part of the class Detect(), which produces four outputs: classes, boxes, scores, masks. Therefore, to get the result as in the picture (Fig. 2), you need to add the following lines:

post process function
def prep_display(results):
    def crop(bbox, shape):
        x1 = max(int(bbox[0] * shape[1]), 0)
        y1 = max(int(bbox[1] * shape[0]), 0)
        x2 = max(int(bbox[2] * shape[1]), 0)
        y2 = max(int(bbox[3] * shape[0]), 0)
        return (slice(y1, y2), slice(x1, x2))
    bboxes, scores, class_ids, masks = [], [], [], []
    for result, mask in zip(results[0][0], results[1]):
        bbox = result[:4].tolist()
        score = result[4]
        class_id = int(result[5])
        if threshold <= score:
            mask = np.where(mask > 0.5, class_id + 1, 0).astype(np.uint8)
            region = crop(bbox, mask.shape)
            cropped = np.zeros(mask.shape, dtype=np.uint8)
            cropped[region] = mask[region]
            bboxes.append(bbox)
            class_ids.append(class_id)
            scores.append(score)
            masks.append(cropped)
    return bboxes, scores, class_ids, masks

where “results” are the four tensors obtained as output postprocess550x550.

and finally draw the result:

function that draws masks and boxes on the image
def onnx_draw(frame, bboxes, scores, class_ids, masks):
      colors = get_colors(len(COCO_CLASSES))
      frame_height, frame_width = frame.shape[0], frame.shape[1]
      # Draw masks
      if len(masks) > 0:
          mask_image = np.zeros(MASK_SHAPE, dtype=np.uint8)
          for mask in masks:
              color_mask = np.array(colors, dtype=np.uint8)[mask]
              filled = np.nonzero(mask)
              mask_image[filled] = color_mask[filled]
          mask_image = cv2.resize(mask_image, (frame_width, frame_height), cv2.INTER_NEAREST)
          cv2.addWeighted(frame, 0.5, mask_image, 0.5, 0.0, frame)
      # Draw boxes
      for bbox, score, class_id in zip(bboxes, scores, class_ids):
          x1, y1 = int(bbox[0] * frame_width), int(bbox[1] * frame_height)
          x2, y2 = int(bbox[2] * frame_width), int(bbox[3] * frame_height)
          color = colors[class_id + 1]
          frame = draw_box(frame, (x1, y1, x2, y2), color, class_id, score)
      return frame

But there is a fourth, most unpleasant moment – yolact_base_54_800000_550x550.onnx PINTO cannot be trained. And if you manage to do this, then if the number of classes differs from the original 80, then postprocess550x550.onnx will stop working. Since the size of the input tensor is y postprocess550x550 fixed. PINTO writes that the size of input tensors for onnx models can be changed using certain programs, but you don’t want to bother with this every time you need to train a model for new classes, instead of having a normally working post-process.
To summarize, if you are not going to retrain yolact for your classes, but you want to quickly launch a model for demonstration, then you can get by with the PINTO repository.

postprocess550x550

In fact, this whole story with postprocess550x550.onnx was associated with the assumption that it would be faster than a hand-written postprocess using python loops. This assumption turned out to be incorrect.

feiyuhuahuo. The last and final option was Yolact_minimal. The project itself, like the model, is a simplified version of the original project dbolya/yolact. Training the model, assessing the accuracy and converting it to the onnx format is done in a few lines and changing the configs. The model is perfectly converted to fp16, and if you use resnet101 as a backbone, it is quantized to int8, although the accuracy after quantization leaves much to be desired. One option to increase the accuracy of a quantized model is to use Quantization Aware Training.
The only negative is that to run inference on a device, you need to transfer the entire post process there, and you can only use those functions that work with numpy arrays. After this you can get some results.

We download a random picture from the Internet, crop it to the desired size and feed it to the neuron. Let’s look at the result:

Rice.  3. Result of the rknn model.

Rice. 3. Result of the rknn model.

We see that the result is not yet satisfactory. A bunch of bboxes, masks overlapping each other, a strange “score”. Be that as it may, something has already appeared and we can work with it. A “Score” greater than one suggests some thoughts. Comparing the rknn model and the Yolact_minimal model, for example using netron.app (as it turned out, it perfectly “eats” rknn models), you can see that the output of the original grid, before being filtered by “nms_score_thre,” goes through the softmax function (class_pred = F.softmax(class_pred, -1)). It’s clear, it means softmax was lost during conversion, let’s add:

def np_softmax(x):
    np_max = np.max(x, axis=1)
    sft_max = []
    for idx, pred in enumerate(x):
        e_x = np.exp(pred - np_max[idx])
        sft_max.append(e_x / e_x.sum())
    sft_max = np.array(sft_max)
    return sft_max
  
  # Обратите внимание, что softmax рассчитывается в цикле, 
  # для каждого вектора класса отдельно (что логично). 

Then we get a nice result. And this on a device the size of a plastic card!

Rice.  4. The result of the rknn model + softmax.

Rice. 4. The result of the rknn model + softmax.

Everything works as it should, and now you can set a confidence level, a threshold that poorly recognized objects should not cross.

All that remains is to train the model for your tasks. How to do this is written in the Yolact_minimal repository. In short, the main thing is to get a JSON file with markup in the format COCO (custom_ann.json). And insert the name of the required classes into the config.

Training on 50 pictures takes about 1.5 hours on one V100 card. The network only works with images whose dimensions are a multiple of 32. I chose images 544 by 544px.

In conclusion, I would like to emphasize that the use of neural networks in solving computer vision problems is becoming increasingly popular and in demand. My experience has shown that the RK3588-based YOLACT model is capable of processing video streams with high accuracy. In addition, thanks to the open source code, anyone can independently run the model and customize it to their needs.

If you have any questions or suggestions, feel free to write them in the comments. I am always ready to share my experience and help in solving complex problems.

I also want to recommend my repository for launching yolact: GitHub. It contains two rknn meshes: “yolact_550.rknn” and “yolact_544.rknn”, and describes two post-processes: ONNX and RKNN, for the first and second model, respectively. Thank you for your attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *