Recurrent networks versus transformers

What is the problem with recurrent neurons?

Problems of machine translation, language design or voice recognition – once upon a time they were all solved using RNNs, so-called recurrent neural networks, which simplified and reduced a huge number of parameters to the final result and conditional forecast.

The difference between an RNN and a regular neuron with hidden layers and output/input is the presence of a temporary component and memory.

Visualization of the operation of a fully connected or simple neural network

Visualization of the operation of a fully connected or simple neural network

The operating principle of recurrent neural networks is based on the idea of ​​feedback, where the output of one step of the network is used as part of the input to the next step. Information from previous steps in the sequence is stored and passed on to the next steps for analysis and prediction. This is why RNNs have dominated language-related problems.

This is the meaning of the name. Recursion. We are gradually moving away from the time step t and input data to a new hidden state, and so we gradually accumulate data. This is how context is taken into account. While reading a book, we remember details from previous pages, and at the end we suddenly begin to write the “detective” we read…

The creators of RNNs simply took into account our way of reading – it is sequential. Word by word, state by state, data by data. When processing a token or word, the AI ​​remembers the “information” from the token and passes it on for use when processing the next word/token. Each layer of the neuron has a kind of “memory”.

In addition to connections between layers, the element receives a connection with itself and returns to itself, transmitting information from the current moment in time t1 to the next moment in time t2.

Let’s take the sentence “I love dogs” as an example. First, the recurrent neuron will carry I through itself, which can be represented as a vector, remember some data and use it in subsequent sequential processing of love.

The recurrent network formulas are a little more complicated:

The trained matrix at a specific time t is multiplied by the input data. All this is summed with the multiplied training matrix and the hidden vector in the previous step (t-1). All this is certainly run through the activation function.

The new hidden state is multiplied by the trainable matrix and is ready. Our exit yt!

The hidden state vector is the “memory” of the layer.

We obtained a new hidden state by peppering the processed old one with additional input data. Everyone understands that such recurrent neural networks, due to their principle of gradual data accumulation, work longer. Imagine a neural network that runs through thousands of words and sentences…

The memory of such a recurrent network is quite indiscriminate, so at some point we may simply forget "information" from the very first “fed” words... When processing long RNN sequences, coders may encounter the problem of vanishing gradient or exploding gradient.

The memory of such a recurrent network is quite indiscriminate, so at some point we may simply forget the “information” from the very first “fed” words… When processing long RNN sequences, coders may encounter the problem of a vanishing gradient or an exploding gradient.

So how do we account for all these unfortunate hidden states, and not just the last one? That’s why enthusiasts developed it in 2017 mechanism “Attention”.

Attention mechanism: RNN + attention

At the right moments, we pay attention to the right words. We don’t distribute tokens one to one. It is obvious that, for example, the sequence of words is different in different languages. And some words are more often used with others. The Attention principle, which was developed by enthusiastic scientists, is designed to distribute the weights or significance of tokens/words and create the effect of “context”.

The principle is as close as possible to simulating semantic chains.

For example, we mention tiles in an architectural, construction or interior context. For any word in a language there is a set of words that are constantly used in context. We say “shoot”, rather, from a gun or pistol… We say “turn on” more often the PC, phone or “plug in”.

Mammal-whale-plankton-water…

Chair-table-set-furniture-fork…

We can distribute the “significance” of some words to others. There are different chairs: on the street, in the kitchen, in a banquet hall, in the office… A chair does not necessarily have to be somehow connected with a set or a plug. But with the table…

That is why it is preferable to distribute the meaning of words contextually in the text.

It is easier to describe the concept of “attention” with the concept of the essence of things. We have a car, what makes a car a car in the first place? Of course – wheels. Wheels are more important than all other elements.

Points of interest are highlighted in green.

Points of interest are highlighted in green.

Attention in machine learning is a mechanism that allows a model to dynamically choose which parts of the input data to focus its attention on when performing a task.

The Attention mechanism is used to calculate a weight for each word in the input sequence based on its importance in the context of the current query or task. These weights are then used to weightily sum the representations of all words to obtain a contextualized representation.

The matrix must learn with training. For each token, it will count a vector and allocate places according to their place in the sentence. It looks something like this:

The Attention mechanism sets a matrix of weights or those very semantic chains where the importance of some words for others is determined. If we open a regular Google translator, we will see a whole list of translations of one word according to their degree of popularity or importance.

When working with typical encoders/decoders, for example, a variational autoencoder, the results using an additional Attention layer are significantly improved. “Attention” reveals the probabilistic approach of the neural network.

What is the probability that the translation of the word sex is love, and not the gender of a person? But the full power of this mechanism is fully developed in transformers.

Attention replaces each token/word embedding with an embedding that contains information about neighboring tokens, rather than using the same embedding for each token regardless of context. If we encoded words according to the principle of a dictionary, we would simply get a “bag of words” that are in no way connected with each other.

Transformers or paradigm shift in NLP

Now, instead of recurrent neurons, the time has come for transformers. They took into account the context selectively and formed unique weight matrices for words. Now the process of natural language processing did not involve the sequential accumulation of heaps of data, but the obtaining of the selective context of individual words and their uses…

The transformer architecture consists of an encoder and a decoder.

The encoder consists of layers, just like the decoder.

Each layer consists of blocks: self-attention and a fully connected neural network (an ordinary neuron, where some layers are connected to all subsequent layers).

The input data passes through the self-attention mechanism and is “fed” to a regular full-size neural network in a new vector representation.

The input data passes through the self-attention mechanism and is “fed” to a regular full-size neural network in a new vector representation.

Self-attention is the main backbone of the transformer’s work. Let's imagine that we have the sentence “The cat catches the mouse.” Each word in this sentence (cat, catch, mouse) is represented by a vector. Self-Attention is a mechanism that allows the model to focus on important words in a sentence and determine which words are more important to understanding the meaning of the entire sentence.

For each word in a sentence, we create three vectors: Query, Key, and Value. We then use these vectors to determine how important each word is to every other word in the sentence. Naturally, such multiplication does not work in a simple way.

The main goal of self-attention is to drag into the neural network the maximum number of vector representations that would outline the significance of individual words in different contexts.

When we say “The cat catches the mouse”, the model can focus on the word “Cat” to understand who we are talking about. To do this, the model calculates how important the word “Cat” is to every other word in the sentence. If it's important, she pays more attention to him.

The word It depends more on animal.  Animals in English are called by the pronoun “It”.

The word It depends more on animal. Animals in English are called by the pronoun “It”.

Thus, thanks to the Self-Attention mechanism, the model can dynamically determine which words in a sentence are most significant, taking into account their context and relationship with each other. As a result of the mechanism’s operation, we get an attention-score.

If we carry out such an operation of a kind of code embedding, namely eight times and get new matrices with weights as output, we will get a whole set of weights for different contexts, and this is even more information! The more information, the better.

ChatGPT is trained on large text datasets, where the model tries to minimize the loss (e.g. cross entropy) between the generated text and the correct answers. During the training process, the model adjusts the weights inside the transformer blocks to improve the quality of text generation. And now we get a very powerful neuron with billions of parameters and quite acceptable text for users.

What transformers are there today besides GPT?

BERT (Bidirectional Encoder Representations from Transformers): Developed at Google, BERT is a transformer model trained on a huge corpus of text data to perform various NLP tasks such as text classification, information extraction, and question-answering systems.

T5 (Text-to-Text Transfer Transformer): Developed by Google, the T5 is a versatile transformer model that can solve a wide range of NLP tasks presented in text-to-text format. This format allows you to use a single trained transformer for various tasks, such as translation, classification, text generation and much more.

XLNet. This approach, also developed by Google, is an extension and improvement of the BERT transformer model. XLNet uses a permutation sentence mechanism and offers improved context modeling and better performance on a variety of NLP tasks.

RoBERTa (Robustly optimized BERT approach). Developed by Facebook, RoBERTa is an improved version of the BERT model that has been trained using various learning strategies such as dynamic masking and long-sequence training, resulting in improved performance on a variety of NLP tasks.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *