MTS AI case

How models are trained

Let's briefly talk about how models are trained.

On the left in the picture is classical training, what is now commonly called pre-training. We simply predict the next word – and so on 500 billion times. This is pre-training. After training, you can do something additional with the model.

Further – alignment. Inside, this is what is now commonly called supervised fine-tuning. When I started, it was called machine learning. That is, we simply taught the models to do something useful. Supervised fine-tuning is just classical training before the invention of pre-training and everything else.

Finally, reinforcement learning based on human feedback – R.L.H.F. After this comes the turn of additional training, sometimes only for a specific task.

If we talk about RLHF in a nutshell, it works like this. Reinforcement learning is a concept where we have an environment, an agent acts in it, and it receives a reward. This is exactly the level of abstraction with which we work. What does it mean? We have an environment – say, a person who listens to what the model “says”. This model is the agent. There is a reward for what she said.

Training a model using this method is incredibly expensive and time-consuming. Just because a person gets tired very quickly, he needs to be taught, and paid too. Therefore, we proposed a new option – to use a different model. Instead of stressing 10 thousand people, we will train another model. And with its help we can train the main one with reinforcement.

As for the measurement, we conduct it on a data set called human eval, that is, training on human judgment. How does it work? For example, we have a text description of a task. Accordingly, in a certain programming language we need to formulate a solution and write code.

Above is a table where you can see the details. The idea is that we can train our little model – 7b, which is the last one here. It's worth noting that it performs very well even compared to larger models, with the exception of one that is too big.

There are other tasks – for example, detecting problems in the code. This is when we try to find an error – not a syntax error, which is very easy to detect with a parser, but something deeper.

You can consider translating from a programming language – for example, C# to java. If you've tried rewriting your code from Python 2 to Python 3, maybe you'll understand the pain and the benefits.

Technical development details

In this section, we will discuss autocompletion in more detail as one of the key functions of the programmer assistant. Since the introduction of GitHub Copilot, a lot has changed compared to the classic feature that everyone is used to.

Now autocompletion works even optionally after a period in the middle of a word. It also generates a high-quality addition of a complex design in several lines, and does not require you to choose from different options. Usually the very first option is correct.

How do we make a quality-competitive Copilot solution?

First, let's tell you how a single-line addition differs from a multi-line addition. Technically there is almost no difference between them. It’s just that in the second case the model needs to be asked to work a little longer.

But there are big differences in user experience. The point is that the developer has to read what the model generated. At this stage, some problems arise, because short code is much faster to read than long code. Moreover, the dependence is nonlinear.

Google conducted a study on its employees. Situation: autocomplete offers code, the developer accepts it and leaves it in his code. So, it turned out that 90% of the code from the autocomplete that the developer accepted was a one-line addition. Therefore, if we generate multi-line content, it should be offered only when it is of high quality and correct. How to achieve this? And also make sure that the supplement is generated of exceptional quality? And how do you even understand that you need to offer a multi-line addition?

Datasets and benchmarks

How do we make them? We take a piece of code and divide it into prompt in the first part, and send the second to target, completion and suffix – located below the cursor. That is, we simulate a situation when a developer needs, for example, to add something after the IF statement.

So we create four types of datasets:

1. On your repositories. They are internal to us, we make them private data.

2. An assistant programmer is already working in our company, keeping logs. We extract data from these logs on which we can train and validate.

3. We took the well-known human VLX and made a dataset for each language using the same method.

4. Data from the most popular and latest github repositories. Herein lies the problem. There is closed data, and there is open data. We cannot send the first ones to anyone – this means that it is difficult for us to compare with any other solutions. The latter, unfortunately, very quickly end up in the training set, and therefore we cannot be sure of their uniqueness. Perhaps Copilot or some other models have already been trained on them.

Metrics

We use two metrics:

  • The percentage of matches of the first significant token, that is, in the generated text and in the target. Let's say we have a prompt, and we need to make an addition to the getUserName function. Regardless of whether we pass a named argument or a positional one, we must have the same first significant token. We highlight it.

  • We simply measure perplexity on the first N tokens.

Possible architectures

And now an overview of the SOTA architectures that exist today. The first is gpt. The picture shows five open models that perform well in autocompletion. B, they are all similar gpt-shki.

Context

Prompt. For gpt to generate anything, it needs context. The classic prompt is used in this capacity. Roughly speaking, this is the text in the file that the developer is editing at this moment: from the beginning of the file to the cursor, at the place where autocompletion is requested.

Suffix. Suffix – the remaining part, from the cursor to the end.

Cross-file context. Snippets from project files opened in this case are often used to increase quality.

Static code analysis augmentations. We experiment with augmentations obtained through static code analysis.

Infilling

Now let's talk about infilling and why nothing works without it. What is this anyway?

The names can be different: fill in the middle, inserts. In general, we want to slightly change the logic of generation and operation of gpt. Moreover, it is necessary that it not, as usual, just predict the continuation of the text, but the text that is in the middle, that is, between the prompt and the suffix that we were talking about.

How to do it? Our prompt, supplied to the model input, is modified as follows:

  • most often the fill token comes first in the middle_begin;

  • then prompt;

  • then token fill in the middle_hole, indicating that an insert should be generated here;

  • then the suffix;

  • then the fill end token, after which we want the model to start generating a response.

Look at the screenshot above. Tokens are painted in different colors as they are seen by the tokenizer of one of the popular models. So, if we train without infilling and we want, for example, it to complete a line like this if, space, parenthesis opens, t. When we taught him, the word token meant the entire token.

And now we want it to complete the text “oken” after the letter t. It turns out that this is not what we taught him. In the case of infilling, we simplified it a little and inserted only one token, but the meaning remains clear. That is, we have this token that separates the word that we want to complete. It turns out that the model predicts exactly what we need. With infilling, when we insert technical tokens, everything works fine. The model completes any parts of words after any letter. You can stop and it will generate it for you.

Interaction between model and users

Now let's talk about an interesting feature of the interaction between the model and users.

It all started when we deployed a new model. It worked for us for about two months without any changes. The only thing is that we monitored the exact match metric. To a first approximation, we can say that this is the proportion of autocompletes proposed to the user that he accepted and left in his code. At first it was at the level of 48%, then we placed an ad on an internal resource and attracted many users.

After this, the figure dropped to 43%, and then began to rise steadily again. Currently, approximately ⅔ of the autocompletes that the system offers to users within our company are accepted by developers. Apparently, developers are simply trained to wait for autocompletion exactly when they know it will be correct. That is, they themselves learn even better than the model. This means that, unfortunately, we will not be able to compare different models offline using user logs.

Autocomplete evaluation

How can autocomplete be improved? The first option is to rerank the candidates suggested by the static analyzer. That is, you have autocompletion, where he suggested different methods after the dot, and we rank them using our GPT model.

Second, we generate only autocompletion that is legal from the point of view of static analysis. How re-ranker works. Let's say the JEDI library offered several autocompletion options. For each of them, we measured, for example, perplexity and give the user an already sorted list.

Customer data

Finally, a few words about systems like Copilot. Their main disadvantage is that they are resource-intensive, plus they are a server solution. It turns out that every second your code must fly to the server for analysis.

From a security point of view, it is clear that the customer now installs such systems so that his code remains on his server. But if each client organization has its own assistant, why not make the solution more personal? We have repositories that already existed before this solution was deployed. We also have logs of user actions. All this can be used, launch a data preparation and training pipeline and make an individual solution for the customer. If this pipeline is well automated, we will get a model that, one might say, learns itself.

In general, that's all for today. If you have any questions or suggestions, write in the comments – we’ll discuss everything!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *