TensorRT 6.x.x.x – high-performance inference for deep learning models (Object Detection and Segmentation)

It only hurts for the first time!

Hello! Dear friends, in this article I want to share my experience of using TensorRT, RetinaNet based on the repository github.com/aidonchuk/retinanet-examples (this is a fork of the official turnip from nvidia, which will allow you to start using production optimized models in the shortest possible time). Scrolling through Community Channels ods.ai, I come across questions about using TensorRT, and basically the questions are repeated, so I decided to write as complete as possible TensorRT, RetinaNet, Unet, and docker-based quick inference guidance.

Task description

I propose setting the task this way: we need to mark up the dataset, train the RetinaNet / Unet network on Pytorch1.3 + on it, convert the received weights to ONNX, then convert them to the TensorRT engine and run this whole thing in docker, preferably on Ubuntu 18 and extremely preferably on ARM (Jetson) * architecture, thereby minimizing manual deployment of the environment. As a result, we will get a container ready not only for export and training of RetinaNet / Unet, but also for the full development and training of classification, segmentation with all the necessary bindings.

Stage 1. Preparation of the environment

It is important to note here that recently I completely abandoned the use and deployment of at least some libraries on the desktop machine, as well as on devbox. The only thing you have to create and install is the python virtual environment and cuda 10.2 (you can restrict yourself to a single nvidia driver) from deb.

Suppose you have a freshly installed Ubuntu 18. Install cuda 10.2 (deb), I will not dwell on the installation process in detail, the official documentation is quite enough.

Now install docker, docker installation guide can be easily found, here is an example www.digitalocean.com/community/tutorials/docker-ubuntu-18-04-1-en, the 19+ version is already available – put it. Well, do not forget to make it possible to use docker without sudo, it will be more convenient. After everything turned out, we do like this:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

And you don’t even have to look into the official repository github.com/NVIDIA/nvidia-docker.

Now do git clone github.com/aidonchuk/retinanet-examples.

It remains just a little bit, in order to start using docker with nvidia-image, we need to register in NGC Cloud and log in. Let’s go here ngc.nvidia.com, register and after we get inside the NGC Cloud, press SETUP in the upper left corner of the screen or click on this link ngc.nvidia.com/setup/api-key. Click “generate key.” I recommend saving it, otherwise the next time you visit it, you will have to regenerate it and, accordingly, deploying it on a new wheelbarrow, repeat this operation.


docker login nvcr.io
Username: $oauthtoken
Password:  - сгенерированный ключ

Username just copy. Well, consider, the environment is deployed!

Stage 2. Assembling the docker container

At the second stage of our work, we will assemble docker and get acquainted with its insides.
Let’s go to the root folder relative to the retina-examples project and run

docker build --build-arg USER=$USER --build-arg UID=$UID --build-arg GID=$GID --build-arg PW=alex -t retinanet:latest retinanet/

We collect docker by throwing the current user into it – this is very useful if you write something on a mounted VOLUME with the rights of the current user, otherwise there will be root and pain.

While docker is going, let’s explore the Dockerfile:

FROM nvcr.io/nvidia/pytorch:19.10-py3

ARG UID=1000
ARG GID=1000
ARG PW=alex
RUN useradd -m ${USER} --uid=${UID} && echo "${USER}:${PW}" | chpasswd

RUN apt-get -y update && apt-get -y upgrade && apt-get -y install curl && apt-get -y install wget && apt-get -y install git && apt-get -y install automake && apt-get install -y sudo && adduser ${USER} sudo
RUN pip install git+https://github.com/bonlime/pytorch-tools.git@master

COPY . retinanet/
RUN pip install --no-cache-dir -e retinanet/
RUN pip install /workspace/retinanet/extras/tensorrt-
RUN pip install tensorboardx
RUN pip install albumentations
RUN pip install setproctitle
RUN pip install paramiko
RUN pip install flask
RUN pip install mem_top
RUN pip install arrow
RUN pip install pycuda
RUN pip install torchvision
RUN pip install pretrainedmodels
RUN pip install efficientnet-pytorch
RUN pip install git+https://github.com/qubvel/segmentation_models.pytorch
RUN pip install pytorch_toolbelt

RUN chown -R ${USER}:${USER} retinanet/

RUN cd /workspace/retinanet/extras/cppapi && mkdir build && cd build && cmake -DCMAKE_CUDA_FLAGS="--expt-extended-lambda -std=c++14" .. && make && cd /workspace

RUN apt-get install -y openssh-server && apt install -y tmux && apt-get -y install bison flex && apt-cache search pcre && apt-get -y install net-tools && apt-get -y install nmap
RUN apt-get -y install libpcre3 libpcre3-dev && apt-get -y install iputils-ping

RUN mkdir /var/run/sshd
RUN echo 'root:pass' | chpasswd
RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN sed 's@sessions*requireds*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd

ENV NOTVISIBLE "in users profile"
RUN echo "export VISIBLE=now" >> /etc/profile
CMD ["/usr/sbin/sshd", "-D"]

As you can see from the text, we take all our favorite ones, compile retinanet, distribute basic tools for the convenience of working with Ubuntu, and configure the openssh server. The first line is just the inheritance of the nvidia image, for which we made a login in NGC Cloud and which contains Pytorch1.3, TensorRT6.x.x.x and a bunch of libs that allow us to compile cpp source code for our detector.

Stage 3. Starting and debugging the docker container

Let’s move on to the main case of using the container and the development environment, to start, run nvidia docker. Run:

docker run --gpus all --net=host -v /home/:/workspace/mounted_vol -d -P --rm --ipc=host -it retinanet:latest

Now the container is accessible via ssh @localhost. After a successful launch, open the project in PyCharm. Next, open

Settings->Project Interpreter->Add->Ssh Interpreter

Step 1

Step 2

Step 3

We select everything as in the screenshots,

Interpreter -> /opt/conda/bin/python

– this will be ln in Python3.6 and

Sync folder -> /workspace/retinanet

We press the finish line, we expect indexing, and that’s it, the environment is ready to use!

IMPORTANT!!! Immediately after indexing, extract the compiled files for Retinanet from docker. In the context menu in the project root, select


One file and two build folders, retinanet.egg-info and _Сso will appear


If your project looks like this, then the environment sees all the necessary files and we are ready to learn RetinaNet.

Stage 4. Marking up the data and training the detector

For markup, I mainly use supervise.ly – a pleasant and convenient tool, in the last time a bunch of stocks were fixed and it became much better behaving.

Suppose that you marked up the dataset and downloaded it, but it won’t work immediately to put it into our RetinaNet, since it is in its own format and for this we need to convert it to COCO. The conversion tool is in:


Please note that the Category in the script is an example and you need to insert your own (you do not need to add the background category)

categories = [{'id': 1, 'name': '1'}, 
                  {'id': 2, 'name': '2'}, 
                  {'id': 3, 'name': '3'},
                  {'id': 4, 'name': '4'}] 

For some reason, the authors of the original repository decided that you will not train anything except COCO / VOC for detection, so I had to slightly modify the source file


Adding tutda favorite augmentations albumentations.readthedocs.io/en/latest and cut out hard-coded categories from COCO. It is also possible to sprinkle large areas of detection if you are looking for small objects in large pictures, you have a small dataset =), and nothing works, but more on that another time.

In general, the train loop is also weak, initially it did not save checkpoints, it used some awful scheduler, etc. But now all you have to do is select the backbone and execute

/opt/conda/bin/python retinanet/main.py

with parameters:

train retinanet_rn34fpn.pth
--backbone ResNet34FPN
--classes 12
--val-iters 10
--images /workspace/mounted_vol/dataset/train/images
--annotations /workspace/mounted_vol/dataset/train_12_class.json
--val-images /workspace/mounted_vol/dataset/test/images_small
--val-annotations /workspace/mounted_vol/dataset/val_10_class_cropped.json
--jitter 256 512
--max-size 512
--batch 32

In the console you will see:

Initializing model...
     model: RetinaNet
  backbone: ResNet18FPN
   classes: 2, anchors: 9
Selected optimization level O0:  Pure FP32 training.

Defaults for this optimization level are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 128.0
Preparing dataset...
    loader: pytorch
    resize: [1024, 1280], max: 1280
    device: 4 gpus
    batch: 4, precision: mixed
Training model for 20000 iterations...
[    1/20000] focal loss: 0.95619, box loss: 0.51584, 4.042s/4-batch (fw: 0.698s, bw: 0.459s), 1.0 im/s, lr: 0.0001
[   12/20000] focal loss: 0.76191, box loss: 0.31794, 0.187s/4-batch (fw: 0.055s, bw: 0.133s), 21.4 im/s, lr: 0.0001
[   24/20000] focal loss: 0.65036, box loss: 0.30269, 0.173s/4-batch (fw: 0.045s, bw: 0.128s), 23.1 im/s, lr: 0.0001
[   36/20000] focal loss: 0.46425, box loss: 0.23141, 0.178s/4-batch (fw: 0.047s, bw: 0.131s), 22.4 im/s, lr: 0.0001
[   48/20000] focal loss: 0.45115, box loss: 0.23505, 0.180s/4-batch (fw: 0.047s, bw: 0.133s), 22.2 im/s, lr: 0.0001
[   59/20000] focal loss: 0.38958, box loss: 0.25373, 0.184s/4-batch (fw: 0.049s, bw: 0.134s), 21.8 im/s, lr: 0.0001
[   71/20000] focal loss: 0.37733, box loss: 0.23988, 0.174s/4-batch (fw: 0.049s, bw: 0.125s), 22.9 im/s, lr: 0.0001
[   83/20000] focal loss: 0.39514, box loss: 0.23878, 0.181s/4-batch (fw: 0.048s, bw: 0.133s), 22.1 im/s, lr: 0.0001
[   94/20000] focal loss: 0.39947, box loss: 0.23817, 0.185s/4-batch (fw: 0.050s, bw: 0.134s), 21.6 im/s, lr: 0.0001
[  105/20000] focal loss: 0.37343, box loss: 0.20238, 0.182s/4-batch (fw: 0.048s, bw: 0.134s), 22.0 im/s, lr: 0.0001
[  116/20000] focal loss: 0.19689, box loss: 0.17371, 0.183s/4-batch (fw: 0.050s, bw: 0.132s), 21.8 im/s, lr: 0.0001
[  128/20000] focal loss: 0.20368, box loss: 0.16538, 0.178s/4-batch (fw: 0.046s, bw: 0.131s), 22.5 im/s, lr: 0.0001
[  140/20000] focal loss: 0.22763, box loss: 0.15772, 0.176s/4-batch (fw: 0.050s, bw: 0.126s), 22.7 im/s, lr: 0.0001
[  148/20000] focal loss: 0.21997, box loss: 0.18400, 0.585s/4-batch (fw: 0.047s, bw: 0.144s), 6.8 im/s, lr: 0.0001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.52674
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.91450
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.35172
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.61881
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.00000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.00000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.58824
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.61765
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.61765
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.61765
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.00000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.00000
Saving model: 148

To study the entire set of parameters look


In general, they are standard for detection, and they have a description. Run the training and wait for the results. An example of inference can be found in:


or execute the command:

/opt/conda/bin/python retinanet/main.py infer retinanet_rn34fpn.pth 
--images /workspace/mounted_vol/dataset/test/images 
--annotations /workspace/mounted_vol/dataset/val.json 
--output result.json 
--resize 256 
--max-size 512 
--batch 32

Focal Loss and several backbones are already built into the repository, and their


The authors give some characteristics in the nameplate:


There is also a backbone ResNeXt50_32x4dFPN and ResNeXt101_32x8dFPN, taken from torchvision.
I hope we’ve sorted out the detection a bit, but you should definitely read the official documentation to understand export and logging modes.

Stage 5. Export and inference of Unet models with Resnet encoder

As you probably noticed, Dockerfile installed libraries for segmentation, and in particular a wonderful lib github.com/qubvel/segmentation_models.pytorch. In the Yunet package, you can find examples of inference and export of pytorch checkpoints in the TensorRT engine.

The main problem when exporting Unet-like models from ONNX to TensoRT is the need to set a fixed Upsample size or use ConvTranspose2D:

import torch.onnx.symbolic_opset9 as onnx_symbolic
        def upsample_nearest2d(g, input, output_size):
            # Currently, TRT 5.1/6.0 ONNX Parser does not support all ONNX ops
            # needed to support dynamic upsampling ONNX forumlation
            # Here we hardcode scale=2 as a temporary workaround
            scales = g.op("Constant", value_t=torch.tensor([1., 1., 2., 2.]))
            return g.op("Upsample", input, scales, mode_s="nearest")

        onnx_symbolic.upsample_nearest2d = upsample_nearest2d

Using this conversion, you can do this automatically when exporting to ONNX, but already in version 7 of TensorRT this problem was solved, and we had to wait very little.


When I started using docker, I had doubts about its performance for my tasks. In one of my units, there is now quite a lot of network traffic created by several cameras.


Various tests on the Internet revealed a relatively large overhead for network interaction and recording on VOLUME, plus an unknown and terrible GIL, and since shooting a frame, working a driver and transmitting a frame over a network are atomic operations in the mode hard real-timeNetwork latency is very critical for me.

But nothing happened =)

P.S. It remains to add your favorite train loop for segmentation and production!


Thanks to the community ods.aiIt’s impossible to develop without it! Many thanks to n01z3, DL, who wished me to take up DL, for his invaluable advice and extraordinary professionalism!

Use optimized models in production!

Aurorai, llc

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *