A Silent Revolution and a New Wild West at ComputerVision

It would seem that there has already been a revolution with Computer Vision. In 2012, they fired algorithms based on convolutional neural networks… From 2014 they reached production, and from 2016 they filled everything. But, at the end of 2020, a new round took place. This time not in 4 years, but in one. let’s talk about Transformers in ComputerVision. The article will provide an overview of new products that have appeared in the last year. If it is more convenient for someone, then the article available as a video on youtube.

Transformers are this type of neural networks created in 2017. Initially, they were used to translations:

But, as it turned out, they worked simply as a universal model of the language. And off we go. Actually, the well-known GPT-3 – spawn of transformers.

What about ComputerVision?
And here everything is interesting. Not to say that transformers are well suited for such tasks. All the same, time series, and besides, one-dimensional. But they work too well in other tasks. In my story, I will go through the most key works, interesting places in their application. I will try to tell you about different options for how transformers were able to cram into CV.


It’s 2020. Poperlo. Why? It’s hard to say. But I think we should start with DETR (End-to-End Object Detection with Transformers), which was released in May 2020. Here Transformers are applied not to the image, but to the features selected by the convolutional network:

There is no particular novelty in this approach, ReInspect in 2015 did something similar, feeding the output of the BackBone network to the input of a recurrent neural network. But how much worse the recurrent network is than the Transformer – just as much ReInspect lost to Detr. The accuracy and ease of training for transformers has grown exponentially.

Of course, there are a couple of funny things that no one has done before DETR (for example, how positional coding is implemented, which is necessary for a transformer). I described my impressions here
I can only add that DETR opened the way to the possibility of using transformers for ComputerVision. Has it been used in practice? Does it work now? I don’t think:

  1. Its main problem is the complexity of training, a long training time. This problem was partially solved Deformable DETR

  2. DETR is not universal. There are tasks where other approaches work better. For example the same iterdet… But in some tasks, leadership still holds (or its derivatives – https://paperswithcode.com/sota/panoptic-segmentation-on-coco-panoptic ).

Immediately after DETR came out Visual transformer (article + good overview) for classification. Here the transformers also take the output Feature map from the standard backbone:

I would not call Visual Transformer a big step, but this is a common thought for those times. Try to apply the transformer to certain features selected through the backbone.


Let’s go further. The next big step is ViT:

He was published in early December 2020 (realization). And here everything is already in an adult way. Transformer as it is. The picture is divided into mini-sections 16 * 16. Each section is fed into the transformer as a “word”, supplemented by a positional encoder.

And, suddenly, it all worked. Apart from the fact that everything was studied for a long time (and the accuracy is not state-of-art). And on bases of less than 14 million images, somehow it did not work top-notch.
But all these problems were solved by an analogue. This time by FaceBook Deit… Which greatly simplified training and inference.

On large datasets, this approach still holds first places in almost all classifications – https://paperswithcode.com/paper/going-deeper-with-image-transformers

In practice, we somehow tried to use it in one task. But, with a dataset of ~ 2-3 thousand pictures, all this did not work very well. And classic ResNet was much more stable and better.


Let’s go further. CLIP… This is a very interesting application of transformers from a completely different side. In CLIP, the task has been reformulated. The task is not to recognize the image, but to find the closest possible textual description for the image. Here the transformer teaches the linguistic part of embedding, and the convolutional network teaches visual embeddings:

Such a thing takes a very long time to study, but it turns out to be universal. It does not degrade when the dataset is changed. The network is able to recognize things that it saw in a completely different form:

Sometimes it works too cool:

But, although this works well on some datasets, it is not a universal approach:

Here is a comparison with the linear approximation of ResNet50. But you need to understand that in terms of datasets, it works much worse than a model trained on 100 pictures.

Out of interest, we tried to test it on several tasks, for example, recognition of actions / clothes. And everywhere CLIP works very badly. In general, you can talk about CLIP for a very long time. There is a good article on Habré. And I made a video where I spoke about him:

Vision Transformers for Dense Prediction

The next grid, which, in my opinion, is indicative – “Vision Transformers for Dense Prediction”, Which came out a month ago. Here you can switch between Vit / Detr sets. You can use convolutions for the first level, or you can use transformers.

In this case, the grid is used not for detection / classification, but for segmentation / depth estimation. What gives State-of-art result in several categories at once, while in RealTime. In general, it’s very sad that @AlexeyAB (author Yolov4 and one of the authors of the article), did not bang here a separate publication about him. In general, the grid is nice, it runs out of the box, but so far I have not tried it anywhere. If anyone is interested, I did a more detailed review here:


At this point, you need to move a little. All that was above are the most striking examples of the main approaches to using transformers:

  • Transformers are used to process the output of the convolutional network

  • Transformers are used to find logic on top of the network issue

  • Transformers are used directly to apply to the image

  • A hybrid of approaches 1-2

Everything below is examples of how the same transformers / approaches are used for other tasks. Go.


Pose3D… The transformer can also be applied to explicit features highlighted by a ready-made network, for example, to skeletons:

In this work, the Transformer is used to restore a 3D model of a person from a series of frames. At CherryLabs, we did this (and more complex reconstructions) 3 years ago, only without transformers, with embeddings. But, of course, Transformers allows you to do it faster and more stable. The result is quite good and stable 3D, without retraining:

The advantage of transformers in this case is the ability to work with data that does not have local correlation. Unlike neural networks (especially convolutional networks). This allows the Transformer to learn from complex and varied examples.

If anything, the idea came to many people at the same time. Here one more approach/ implementation of the same idea.


If you look at where convolutional networks are missing precisely because there is an embedded internal logic of the image, then the pose immediately comes to mind. TransPose – a grid that recognizes a pose by optimizing convolutions:

Compare with classical approaches in pose recognition (quite an old version OpenPose)

And there were up to ten such stages in different works. Now they are replaced by one transformer. It turns out, of course, much better than modern networks:


Above, we have already mentioned one segmentation grid based on Intel’s Transformers. SWIN from Microsoft shows better results, but not in RealTime. In fact, this is an improved and expanded VIT / Deit, reworked for segmentation:

This affects the speed, but an impressive quality, leadership in many categories – https://paperswithcode.com/paper/swin-transformer-hierarchical-vision


There are problems in which convolutional networks do not work very well at all. For example the task of matching two images. A year and a half ago, for this, they often used the classic pipeline through SIFT / SURF + RANSAK ( good guide on this topic + video which I recorded a year ago). A year ago appeared SuperGlue– the only cool application Graph Neural Network I have seen at ComputerVision. In this case, SuperGlue only solved the problem of matching. And now there is a solution on transformers, LOFTR almost End-To-End:

I didn’t have time to use it myself, but it looks cool:

Recognizing actions

In general, of course, transformers are good wherever there are sequences, a complex logical structure, or their analysis is required. There are already several networks where actions are analyzed by transformers 🙁Video Transformer Network, ActionBert). They promise to add in the near future MMAction


I already wrote a year ago a huge article on Habré what works in tracking and how to track objects. Many approaches, complex logic. Only a year has passed, and by many benchmarks there is an unconditional leader – STARK:

Of course, it does not solve all cases. And not all cases are won by transformers. But, most likely, it will not last long. For example, eye tracking on transformers. Here is the tracking by the type of Siamese networks a couple of weeks ago. Here is the BBOX tracking + features one, and here other, with almost the same names


And everyone has good speed.


Re-identification can be taken out of tracking, as you remember. 20 days ago there was a transformer with recognition ReID – tracking can boost very well.

Face recognition through transformers a week ago, it looks like it also came up:


If you look at more specific applications, there are also a lot of interesting things. VIT is already being crammed with might and main for CT and MRI analysis (one,2):

And for segmentation (one,2):


What surprises me is that I don’t see a good OCR implementation on transformers. There are several examples, but according to benchmarks they are somehow at the bottom:

All state-of-art is still on classical approaches. But people are trying. Even I tried to screw something about 2 years ago. But it doesn’t give any result.

More interesting

I never would have thought, but transformers have already applied for coloring pictures. And this is probably correct:

What’s next

It seems to me that transformers should be in the top in almost all categories for ComputerVision. And certainly for any video analytics.

Transformers eat their input linearly. They store spatial information in various clever ways. But it seems that sooner or later someone will come up with a more relevant implementation, perhaps where the transformer and 2D convolution will be combined. And, people already trying to

In the meantime, let’s look at how the world is changing. Literally every day. When enough material accumulates, I usually post a large article on Habr. And I usually tell about individual articles / ideas in my channel – https://t.me/CVML_team (duplicate here https://vk.com/cvml_team ).

And the current article, if it is more convenient for anyone, is available on youtube:

Similar Posts

Leave a Reply