In recent years, transformers, which were originally developed for natural language processing, have become increasingly important in areas of computer vision such as object detection, image segmentation, etc., overtaking traditional architectures based only on convolutional layers.
Among the best-known transformer architectures are Google ViT or Microsoft Swin Transformer, which dominate object detection and natural image segmentation. These two architectures are just one example of the many works that have been done to adapt transformers to natural image processing.
FAIR recently introduced ConvNeXt that re-explores design spaces and tests the limits of what ConvNet alone can achieve. Through their experience, they found a way to upgrade a purely convolutional network to outperform transformer-based networks, and thus return convolutional networks to greatness.
ConvNet Modernization Roadmap
 presents a five-part roadmap for converting a standard ResNet to a transformer step by step, which we will cover in the following subsections.
In the first part of the roadmap, the author makes two initial changes to the original ResNet-50 architecture as shown in the table below,
First, they change the number of blocks at each convolution step. As we can see from the table, in the original ResNet-50 architecture, the number of blocks is distributed according to the ratio (3, 4, 6, 3) in 4 stages. The authors proposed to adopt a multi-stage idea, outputting different resolution values at each stage, similar to Swin Transformer with a ratio of 1:1:3:1.
Therefore, ConvNeXt follows this relationship and redistributes the number of convolution blocks by (3, 3, 9, 3), which leads to an improvement in model accuracy by 0.6% before 79.4%.
The second change they make involves redesigning the initial convolution block, in the original ResNet-50 architecture, where a 7×7 convolution layer is cast from the input in steps of 2 plus a max-pooling operation that downsamples the input image by a factor of 4.
They replace this first convolution block using the same approach as in Swin Transformer, using a non-overlapping 4×4 convolution with a step of 4 to reproduce the ViT patch slicing operations, resulting in an improvement in model accuracy by another 0.1%which now reaches 79.5%.
In this part of the roadmap, the authors suggested using some basic ideas ResNeXt which has a better trade-off between accuracy and computation than traditional ResNet by using grouped convolution while increasing the network width to compensate for model power loss.
 uses depth convolution, making the number of groups equal to the number of input channels, so each convolution kernel processes one channel and mixes information only in the spatial dimension to get an effect similar to the self-attention mechanism, resulting in an increase in model accuracy by one% before 80.5%.
Invert bottleneck layer
The authors create an inverted bottleneck, comparable to MobileNetV2, in which the hidden size of the MLP block is four times the input size.
Although the FLOPs of the deep convolutional layer increase after such a reversal, the FLOPs of the entire network fall due to the effect of residual block downsampling, resulting in a 0.1 percent increase in accuracy.
Increasing the convolution kernel
In this part, the authors discuss the impact of different convolution kernel widths on model performance. However, after reversing the bottleneck layer and increasing the dimensionality of the convolution layer, increasing the convolution kernel itself will significantly increase the number of parameters, so they shift up the position of the Conv deep layer ((b) – in the figure above), temporarily reducing the accuracy of the model to 79.9%.
They used different convolution kernel sizes from 3×3 to 11×11, and achieved an optimal accuracy of 80.6% for a 7×7 convolution kernel.
Finally, the authors consider the effect of activation and normalization functions on model performance.
They first tried to replace the traditional ReLU activation function with a softer version like GELU, which did not improve the accuracy of the model. They then try a similar approach to the Transformer block, using activation functions only in MLP blocks, so the GELU activation function is only applied between the two 1×1 layers, resulting in an increase in model accuracy by 0.7% and finally reaches 81.3%.
The number of normalization layers has also been reduced, and the Batch Normalization layer has been replaced with a Batch Normalization layer, resulting in an increase in accuracy by 0.1%and now she’s reached 81.5%.
The structure of one ConvNeXt block is shown below,
ConvNeXt is roughly comparable to Swin Transformer at all scales in terms of number of parameters, bandwidth, and memory usage. The advantage of ConvNeXt is that it does not require additional structures, such as the inclusion of window offset attention mechanisms, relative position shifting, and so on.
They evaluated ConvNeXts with various vision tasks such as ImageNet classification, object detection/segmentation on COCO, and semantic segmentation, achieving comparable or better performance than SwinTransformer.
This work encourages scientists to reconsider the importance of convolution in computer vision.
We invite you to the next open lesson at OTUS “Anomaly Detection”. In the lesson, we will consider the following issues: problem statement, finding anomalies in different distributions, SVD-feature extraction, Autoencoder, PaDiM. registration link.