Natural Language Processing. Results 2019 and trends for 2020
BERTs, BERTs are everywhere
Let’s start in order. If you have not left for the remote Siberian taiga or a vacation in Goa for the last year and a half, then you probably heard the word BERT. Appearing at the very end of 2018, over the past time, this model has gained such popularity that just such a picture will be just right:
BERTs really captivated everything that could be filled in NLP. They began to be used for classification, recognition of named entities, and even for machine translation. Simply put, you cannot bypass them and you still have to tell what it is.
The picture shows a comparison of the hero of the occasion (left) with two models that also sounded. On the right is the immediate predecessor of BERT – model ELMo.
It is the authors of this model that we owe to the Sesame Street overwhelming tsunami area: in Russia and the CIS this children’s TV show was not so popular, so you may not be aware, but Elmo and Bert are the names of the characters from there; well, after everyone saw that it was so possible, the creative could not be stopped. Then they began to sort through all the names from this program. I’m afraid that when they run out, they will switch to Pokemon.
ELMo model from Allen AI – a kind of heir to the entire development of the region in previous years – namely, a bidirectional recurrent neural network, plus several new tricks to boot. Colleagues from Openai decided what could be done better. And for this you just need to apply the architecture presented a year before this Google Transformer to this task. I believe that over the past 2.5 years, everyone has already managed to get acquainted with this architecture, so I will not dwell on it in detail. For those who wish to receive communion, I refer to my review from the 2017th year. They (OpenAI staff) called their model GPT-2. And then on this model they got a good deal. But let’s leave it on their conscience, and return to our sheep, that is, the models.
One of the most important ELMo tricks was pre-training on a large, unallocated case. It turned out very well, and colleagues from Google decided that we can do even better. In addition to applying the Transformer architecture (which was already in GPT-2), BERT, which stands for Bidirectional Encoder Representations from Transformers, that is, vector representations from bidirectional coding based on the Transformer architecture, contained several more important things. Specifically, the most important was the way to train on a large case.
The picture shows a method for marking up unallocated data. Two layout methods are specifically shown at once. First, a sequence of tokens (words) is taken, for example, a sentence, and in this sequence one arbitrary token is maximized ([MASK]) And the model in the learning process should guess what kind of token was disguised. The second way – two sentences are taken sequentially or from arbitrary places in the text. And the model must guess whether these sentences were consistent ([CLS] and [SEP])
The idea of such training was extremely effective. The answer from sworn Facebook friends was a model Roberta, an article about this model is called “Sustainably Optimized BERT Training”. Further more. I will not list all the ways to improve the training of a large language model based on the Transfomer architecture due to the fact that it is simply boring. I’m probably mentioning only the work of my colleagues from Hong Kong – ERNIE. In their work, colleagues enrich training through the use of knowledge graphs.
Before moving on, a few useful links: an article on Bert. And set trained BERT and ELMo models for the Russian language.
But enough about BERTs. There are several more important trends. First of all, this is a trend to reduce the size of the model. The same BERT is very demanding on resources, and many began to think about how to maintain (or not really lose) quality, reduce the required resources for the models to work. Google colleagues came up with a little BERT, I’m not joking – ALBERT: A little BERT. You can see that the small BERT even surpasses its older brother in most tasks, while having an order of magnitude less parameters.
Another approach to the same bar was made again by my colleagues from Hong Kong. They came up with a tiny BERT – Tinybert. (If at this point you thought that the names began to be repeated, I am inclined to agree with you.)
The fundamental difference between the two above models is that if ALBERT uses tricky tricks to reduce the original BERT model, for example, parameter sharing and reducing the dimension of internal vector representations through matrix decomposition, then TinyBERT uses a fundamentally different approach, namely the distillation of knowledge, that is, there is a small model that learns to repeat after her older sister in the learning process.
In recent years (since about 1990, when the Internet appeared), there has been an increase in available buildings. Then came the algorithms that became capable of processing such large enclosures (this is what we call the “deep learning revolution”, this is already the year since 2013). And, as a result, it began to be perceived normally that in order to obtain good quality in some task, huge bodies of marked-up data are needed – text bodies in our case. For example, typical cases for learning machine translation tasks today are measured in millions of pairs of sentences. It has long been obvious that for many tasks it is impossible to assemble such cases in a reasonable amount of time and for a reasonable amount of money. For a long time it was not very clear what to do about it. But last year, BERT came on the scene (who would you think?). This model was able to pre-train on large volumes of unallocated texts, and the finished model was easy to adapt to the task with a small case.
All of the tasks listed in this table have training buildings in size several thousand units. That is, two to three orders of magnitude less. And this is another reason why BERT (and its descendants and relatives) have become so popular.
Well, in the end, a couple of new trends, as I saw them. First of all, this is a fundamental change in attitude to the text. If all the previous time in most tasks the text was perceived only as input material, and the output was something useful, for example, a class label. Now the community has the opportunity to remember that the text is primarily a means of communication, that is, you can “talk” to the model — ask questions and receive answers in the form of a human-readable text. This is what the new article from Google says. T5 (the name can be translated as “five times transformer”).
Another important trend is that the region is re-learning to work with long texts. Since the 70s, the community has ways to work with text of arbitrary length – take the same TF-IDF. But these models have their own quality limit. But the new deep learning models were not able to work with long texts (the same BERT has a limit of 512 tokens of the length of the input text). But lately, at least two works have appeared that from different sides approach the problem of long text. The first work from the group of Ruslan Salakhutdinov called Transformer-XL.
In this work, the idea is revived that made recurrent networks so popular – you can save the previous state and use it to build the next one, even if you don’t roll the gradient backward in time (BPTT).
Second one Work It works with Legendre polynomials and with their help allows you to process sequences of tens of thousands of tokens with recurrent neural networks.
On this, I would like to finish the review of the changes that have taken place and emerging trends. Let’s see what will happen this year, I’m sure that a lot of interesting things. Video of my speech on the same topic on the Data Tree:
P.S. We will soon have some more interesting announcements, do not switch!