how neural networks learned to identify dependencies in images

In this article we will try to talk about the transformer architecture of VIT and the background of its formation. Today it is not entirely clear why this “format” of neurons is so effective. Some say attention mechanism, but some practitioners are betting more on the Computer Vision field on MetaFormer. https://github.com/sail-sg/poolformer

Neural networks remain a “shadow” process for us, like a black box. And the study of Deep Learning no longer resembles mathematics, but biology, where we monitor the behavior of our brainchild.

Historical insight into a mathematical experiment

Almost a month has passed since we posted a lot of material on transformer architecture. She turned the game 90 degrees, leaving recurrent neurons behind the field of NLP and data mining. Transformers are not just an architecture for working with language and the principle of “translating” words into code, but a system built around the “attention” mechanism.

We will refresh your memory or free 10 minutes from reading our other articleswhere they talked in detail about the technology in its original form from the article “attention all is you need”.

What advantage do transformers have over recurrent neural networks? Access to all sequential data while processing one piece of information. By information and data we simply mean words. RNN operated on the principle of gradual accumulation of data after running the network – a kind of “memory” format was obtained. This memory was implemented in separate “hidden” layers.

h – hidden layers

h – hidden layers

But the accumulation process looked like throwing ingredients into a salad. At first, we receive intelligible data that the neural network consistently uses during training, but at some point the “long-term” dependencies are lost. Instead of a winter salad, we ended up with a mishmash, in which only cucumbers and peas were visible, which we added at the last moment. We're exaggerating. Translated into technical language, long-term dependencies are lost, and the effect of a damped or “exploding gradient” appears.

Gradient decay occurs when the gradients passed back through the RNN time steps during backpropagation are too small. Limited range of activation function values ​​(for example, sigmoidal activation function), inappropriate initialization of weights also lead to this problem. When the gradient completely fades away, the model stops learning. On the contrary, the explosion of the gradient occurs when it becomes extreme.

To combat this, you can use “gradient clipping” to prevent gradient values ​​from flying off into space, choosing more appropriate activation functions (for example, ReLU instead of sigmoid or hyperbolic tangent), initializing weights taking into account the scale of the data, and using long-term memory architectures such as LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit), which are specifically designed to solve the gradient decay problem by introducing information flow control mechanisms.

But GRU and LTSM have many parameters, which makes the computational effort much larger than that of imperfect RNNs. We remind you that recurrent neural networks cannot be calculated in parallel, unlike transformers.

These architectures require hyperparameter tuning. By the way, we recently wrote about the Optuna library and ways to optimize the “global settings” of AI.

On the other hand, recurrent networks were among the top neurons for processing sequential data, including sounds.

Thanks to “attention,” transformer AI could peer into the right words and form dependencies between them, building sentences more accurately, working with all sequential data. While the recurrent neural network, accumulating additional layers of states, gradually “forgot” the initial results of the first layers, blocking the path to working with long data sets, the transformer could access all the results during the learning process.

And this opened the way to parallel calculations and the formation of pseudo-semantic dependencies between words.

The transformer itself consists of an encoder and a decoder. Each of these components contains several layers that perform data processing. Each decoder and encoder layer typically consists of two main sublayers: an attention mechanism and a fully connected neural network (Feed Forward Network).

attention begins by calculating weights for each element of the input data. In the process of calculating weights for each input data element, the attention engine uses a similarity function (or attention function) that evaluates the degree of similarity between the current state of the model and each input data element.

Input data is supplied to the transformer input in the form of a sequence of tokens or elements. Each token is first passed through an input embedding matrix, which converts it into a vector representation, allowing the network to work with data in vector form. Positional embeddings are added to input embeddings to convey information about the position of each element in the sequence. This is necessary because the transformer does not have a built-in understanding of the order of elements in a sequence.

One of the most common methods for calculating weights is the dot-product attention mechanism.

For each element of the input data, the scalar product of its vector representation (for example, embedding) with the current vector of hidden states of the model is calculated. The resulting values ​​are then normalized using the softmax function to obtain weights that sum to 1 and determine the significance of each element.

After calculating the weights, a weighted summation of all elements of the input data occurs, taking into account their importance.

The attention mechanism allows the transformer to pay attention to different parts of the input data and find important aspects that are necessary to solve a specific problem. It allows you to model long-term dependencies between elements of a sequence, solving RNN problems. A fully connected neural network typically follows the attention engine and performs additional processing on the data, which allows the transformer to extract more complex and abstract dependencies in the data.

But a feature of the transformer remains “excessive” data vectorization, which is why the introduction of the transformer architecture is called a kind of mathematical experiment, which was ultimately a success. Squeeze out as much data as possible.

The ability to form complex dependencies between data elements quickly became associated in the eyes of researchers with images. This is how VIT was born.

Mutiny on a ship or how CNN lost its leadership

How do CNNs work?

These networks apply convolutional filters to input images to create a “future map,” which is a feature map that is activated in areas of the image where certain patterns and structures are detected. Each convolutional layer in a CNN has several of these filters, each specialized in detecting different features such as edges, textures, or shapes.

Convolution is followed by the application of an activation function such as ReLU to introduce nonlinearity into the model and activate certain features. Pooling layers are then applied to reduce the spatial dimensions of the feature maps and improve scale and translation invariance. This helps the network focus on the most important features of the image and reduces the number of model parameters.

After several layers of convolution and pooling, the features are concatenated and fed into fully connected regressor/classifier layers, which compute the final predictions.

CNNs work by analyzing an image piece by piece. They start with small pieces of the image and gradually learn to recognize different shapes and patterns.

Why is this not effective enough?

In a CNN, context is only analyzed locally, within a fixed window, which may result in the loss of global information about the image.

Yes, locality was presented as an advantage, but such “frameworks” led to scaling problems.

The architecture of the Vision Transformer (VIT) neural network was presented in the article “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” (Dosovitskiy et al., 2020) Not Dostoevsky. And yes. 16×16 refers to the pixel display of the size of a single patch, since applying attention to an individual pixel leads to high computational costs. The idea for creating ViT arose from the desire to adapt the successfully applied transformer architecture.

Example of ViT working for Covid disease assessment

Example of ViT working for Covid disease assessment

The way ViT works is to represent an image as a sequence of patches (or “words”), and then process this sequence using transformer mechanisms.

Each image patch first goes through a trained embedding that converts the “data” into a vector representation. Just like in the classic transformer. The patches, along with positional embeddings that encode information about the position of each patch in the original image, are then fed to the input of the transformer.

The transformer consists of several layers, each of which uses attention mechanisms to identify relationships between patches and aggregate information, defining the same attention mechanism.

After each image patch is converted to a vector representation and passed through multiple transformer layers, the patch representations are aggregated to produce a final representation of the entire image.

Aggregation can be performed in various ways, depending on the specific task and model architecture. One of the most common aggregation methods is global pooling (e.g., average pooling or maximum pooling) across all image patches.

This means that for each vector (function) dimension in each patch's representation, the average value (or maximum value) is taken across all patches. This produces a single fixed-length vector that represents an aggregated representation of the entire image.

Another aggregation method would be to use additional layers within the transform that aggregate information from different parts of the image. For example, after the transformer layers, additional attention layers or convolutional layers can be added to process the patch representations and produce the final aggregated representation.

The aggregated representation is then fed to a classifier, which produces a prediction for a specific task, such as image classification.

Since then, many extensions have been invented to overcome some of the problems and improve performance, despite the possibility of parallelization of calculations. For example, classic VaiT requires a lot of data and is often overtrained.

Since then, several modifications and improvements to the VIT architecture have been proposed. Here are some of them:

DeiT (Data-efficient Image Transformer). DeiT uses knowledge distillation for effective learning. Knowledge from complex models is transferred to simple ones. The architecture uses pre-trained models such as large CNNs as teachers to determine the transformer.

Since the main problem of transformers in the original formulation is their susceptibility to overtraining, distillation turned out to be the best method for neuron augmentation.

ViT-Lite. It works on the principle of reducing the size and complexity of the transformer model for use, for example, on mobile phones. The number of parameters is reduced, the architecture is simplified, and quantization is applied to embendings/vector representations of patches.

T2T-ViT. Transforms an input image into a sequence of tokens and applies transformer operations to learn internal dependencies between tokens. By processing each token separately and the interactions between them, the model can capture global and local dependencies in the image.

VoVNet-ViT. The approach combines the transformer architecture with the VoVNet architecture, which specializes in extracting features from images using convolutional layers. The mechanism by which it was possible to optimize the neuron by reducing the length of tokens.

In any case, there are many modifications. All of them solve problems with retraining, the size of the neural network, or bring the level of AI to an optimized form for specific software. So ViT also has its drawbacks in comparison, for example, with the more optimized but less accurate YOLO.

It seems as if “attention” is the most important aspect of the transformer. Well, of course. This is all they talk about in all lectures, but this is not quite the case – the effectiveness of the latter mechanism is refuted by some researchers. So we have yet to understand the root of the effectiveness of this type of architecture.

We will continue to explain the latest ML research architectures in simple language.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *