Transformers walk the planet! In the article, we will remember / learn how visual attention works, understand what is wrong with it, and most importantly how to fix it in order to get the best paper ICCV21 at the output.
CV-transformers in a nutshell
Attention Is All You Need
Let’s start from afar, namely from 2017, when A Vaswani et al. published a famous article “Attention Is All You Need”, in which the Transformer neural network architecture was proposed to solve the problem seq2seq and in particular machine translation. I will not talk about how significant this event was for the whole NLP. I will just say that at the moment almost every ML solution that works with text is reaping the benefits of that success, using the Transformer-based architecture directly, working with embeddings from BERT or in some other way. The key and conceptually almost the only component of the transformer is the Multi-Head Attention layer. As applied to the problem of machine translation, it made it possible to take into account the interaction between words located at an arbitrarily large distance in the text, which favorably distinguished the transformer against the background of other translation models and allowed it to take a place in the sun. Formally, this layer is written in terms of the following transformations:
Since 2017, countless modifications of the transformer have been proposed (Linformer, Reformer, Perfomer, etc.), making it more computationally efficient, stabilizing learning, and so on. Such a boom in transformers could not help but affect other areas of application of deep learning besides NLP, and from ~ 2020 they began to penetrate into CV.
Transforming Computer Vision
In general, the idea of applying a transformer to images may initially confuse the reader. After all, text and pictures are quite different modalities, at least in that the text is a sequence of words. An image is also, in a sense, a sequence (of pixels), the direction of which can be determined artificially, for example line by line, however, such a definition will not have a semantic meaning, unlike the text. However, do not forget that Multi-Head Attention is actually not an operation on sequences, but on unordered sets of vectors, and the text is artificially endowed with a sequential structure using positional encodingso the argument above becomes invalid. A valid argument against Visual Transformers might be the lack of pleasant inductive bias, available for convolutional networks: equivariance with respect to shifts and the assumption of spatial locality of principals. However, this is also a controversial point, more about which you can read here. In the meantime, doubters doubted, research groups did, and showed the world several Visual Transoformers (good survey), including ViT, on the example of which we will find out how to reformulate Multi-Head Attention for images.
An Image Is Worth 16×16 Words
andand suggested a fairly straightforward architecture:
The original image is cut into 16×16 patches, they are stretched into vectors and all are passed through a linear layer. Further, trainable vectors are added to them, playing the role of positional embeddings, and a separate trainable embedding is added to the set, which is a direct analogue of the BERT CLS token. And that’s all! Next comes the most common Transformer Encoder (N x Multi-Head Attention, if you like), and the image class is predicted by a small perceptron taking as input what happened in place of the CLS token. Like ~ any transformer, the model turned out to be very voracious in the sense that in order to obtain near-cell results, it needs to be pre-trained on huge datasets, such as a closed Google JFT-300M… Nevertheless, in a certain setup, the grid was able to bypass cellular BiT-L and Noisy Student, which can be considered a success. For details, I refer the reader to original article, a lot of interesting things can be found in ablation, I especially advise you to study the Mean Attention Distance graphs, which are analogous to the receptive field.
Everything seems to be good, but the CV is not engaged in a single classification. There are tasks like Object Detection, in which small details are often important, or segmentation tasks, for which it is generally necessary to do pixel-level prediction. All this requires at least the ability to work with high-resolution images, that is, a significant increase in the size of the input. And as you can see, Attention works in quadratic input time, which in the case of 1920×1920 pictures is the most acute problem, since the forward pass time soars to the skies. In addition, small details can be lost already on the first layer, which is the essence of a 16×16 convolution with a stride 16. Who is to blame and what to do? The answer to the first question is + – it’s clear – it’s about too stubborn adaptation of the transformer architecture for CV. The rest of this article is devoted to answering the second question.
ViT problems were identified in the previous paragraph, so we will not beat around the bush and immediately proceed to consider the architecture proposed in the article Swin Transformer: Hierarchical Vision Transformer using Shifted Windows:
The first layer is qualitatively the same as in ViT – the original image is cut into patches and projected as a linear layer. The only difference is that in Swin, on the first layer, the patches are 4×4, which allows for smaller context. Then there are several Patch Merging and Swin Transformer Block layers. Patch Merging concatenates features of neighboring (in a 2×2 window) tokens and downsizes, getting a higher-level view. Thus, after each Stage, “maps” of features are formed, containing information at different spatial scales, which just allows you to obtain a hierarchical representation of the image, which is useful for further segmentation / Object Detection / etc:
This makes the Swin Transfomer a versatile backbone for a variety of CV tasks.
Swin Transformer Block is a key highlight of the entire architecture:
As can be seen from the diagram, two consecutive blocks represent two classic transformer blocks with MLP, LayerNorms and Pre-Activation Residuals, however, Attention has been replaced with something more tricky, which we will certainly move on to analyze.
(Shifted) Window Multi-Head Attention
As mentioned, the problem with Multi-Head Attention is its quadratic complexity, which painfully shoots in the leg when applied to high-resolution images. A rather simple solution comes to mind, presented in the article about Longformer – let’s calculate Attention for each token not with all other tokens, but only with those located in a certain window of a fixed size (Window Mutli-Head Attention). If the token dimension is C, and the window size is MxM, then the difficulties for (Window) Multi-Head Self Attentions are as follows:
That is, Attention now works in time linear in hw! However, this approach reduces the overall representational power of the network, since tokens from different windows will not interact in any way. To remedy the situation, the authors did a curious thing. After each block with Window Multi-Head Attention, they put a similar layer with diagonally shifted Attention windows:
This brought back the interaction between tokens while leaving linear computational complexity.
As illustrated above, shifting Attention windows increases their number. This means that the implementation of this layer with naive padding of the original “map” of features with zeros will oblige to count more Attentions (9 instead of 4 in the example) than we would count without a shift. In order not to perform unnecessary calculations, the authors proposed to cyclically shift the image itself before calculating and calculate the already masked Attention in order to exclude the interaction of non-neighboring tokens. This approach is computationally more efficient than the naive one, since the number of Attentions counted does not increase:
Also in Swin, the authors used slightly different positional embeddings. They were replaced with a learning matrix B, called the relative position bias, which is added to the query and key product under softmax:
As it turned out, such a podod leads to better quality.
Experiments and results
In total, the authors offered 4 models of different sizes:
For a fair comparison, the parameters were selected so that in terms of the size and number of calculations, Swin-B approximately corresponded to ViT-B / DeiT-B, and Swin-T and Swin-S to ResNet-50 and ResNet-101, respectively.
In this benchmark, two setups were tested: training on ImageNet-1k and pre-training on ImageNet-22K with additional training on ImageNet-1K. The models were compared for top-1 accuracy.
In the first production, Swins outperformed other Visual Transformers by more than 1.5%, including ViTs, which lagged behind the first by as much as 4%. Cellular EfficienetNet-s and RegNet-s turned out to be a more serious rival – statistically significant here we can only talk about improving the balance between accuracy and speed. In the second setting, pre-training on ImageNet-22K gave ~ 2% increase in accuracy, and Swin-L reached 87.3% top-1 accuracy. This once again confirms the importance of prefitting, especially for transformer architectures.
COCO object detection
To evaluate Swin as a backbone for detection, the authors used it together with detection frameworks such as Cascade Mask R-CNN, ATSS, RepPointsV2, and Sparse R-CNN. ResNe (X) t, DeiT, and several cellular convolutional aritectures were taken as backbones for comparison.
For all frameworks, Swin backbone gave a confident + 3.5-4.2% AP relative to the classic ResNet50. With respect to ResNe (X) t-a, Swin also showed ~ 3% AP growth for several of its versions Swin-T, Swin-S and Swin-B. DeiT lost a little less to Swin – about 2% AP, but was much slower due to honest Multi-Head Attention throughout the picture. Well, and a relatively large set of cellular detectors Swin-L from HTC showed an improvement of ~ 2.6 AP.
ADE20K semantic segmentation
For segmentation using Swin, the UperNet framework was chosen, it was compared with several popular segmenters, as well as with a DeiT-based model. Swin-S outperformed Deit-S by as much as 5.3 mIoU, and ResNet-101 and ResNeSt by 4.4 and 2.4 mIou, respectively. At the same time, Swin-L, pre-trained on ImageNet-22k, knocked out 53.5 mIoU, bypassing SETR by 3.2 mIoU.
As a result, we have the following: the authors managed to somewhat reformulate the transformer architecture for CV tasks, making it computationally more optimal due to the use of local Attention. At the same time, Shifted Window Multi-Head Attention left the network’s representative ability at a level sufficient to compete with current cellular models. Thanks to this, it became possible to construct an architura that allows extracting features from images at different spatial scales, which made it possible to successfully use Swin as a backbone in segmentation and detection tasks, where before that transformers were at lower positions. This is success!