How to (Quickly) Make a Russian Local ChatGPT

This story began in early March of this year. ChatGPT was then in its prime. Sasha Kukushkin, whom we have known for a long time, came to Telegram. He asked if Sasha Nikolic and I were working on language models for the Russian language, and how we could help.

Russian Turbo Alpaca

Russian Turbo Alpaca

And it so happened that we were really engaged, I was trying to collect a dataset for training a normal base model, rulmwhile Sasha experimented with existing Russian base models and makeshift instructional datasets.

After that, we continued to do the same thing for a while. I slowly expanded rulm by inertia new sets data. Considering that we will not be able to train the base model in the near future, we decided to focus on additional training on instructions and almost began to convert what we have into an instruction format similar to Flan. And then I managed to carefully re-read the article.

There I found this picture:

Flan scaling

Flan scaling

According to this picture, it turned out that with the current basic models for Russian, we had nothing to catch. The largest Russian model at that time was rugpt3large with 760 million parameters, which corresponds to the very beginning of this graph. At the same time, there is another well-known picture from OpenAI postbecause of which we initially started all this:

Scaling by InstructGPT

Scaling by InstructGPT

There are no models with 760 million parameters on it at all, and additional training for all sizes consistently shows good results. And the difference here is in the data and in the metric. Flan has automatically converted datasets for various NLP tasks, while OpenAI has manual markup. In addition, the OpenAI picture evaluates user preferences, and the Flan picture shows real metrics on real tasks. Here’s the problem here…

To which the answer was:

We’re damn lucky. 4 days before, Alpaca came out, which essentially generated a near-human instructional dataset using GPT-3. We decided to do exactly the same, but in Russian and with gpt-3.5-turbo instead of text-davinci-003 to reduce the cost of the process.

No sooner said than done! The data was collected quickly, in 4 days. Some more time was spent on optimizing API calls and adding up to 30 thousand examples. After that, we started to train different basic models. Started from rugpt, xglm And mt0_xxl. It was acceptable, but nothing special.

RuGPT-large.  The last question is there on purpose, because it was guaranteed that it could not be in the training sample due to the censorship of ChatGPT.

RuGPT-large. The last question is there on purpose, because it was guaranteed that it could not be in the training sample due to the censorship of ChatGPT.

We posted the models on HuggingFace as soon as they were ready, and they began to ask us about LLaMA additional training. I had a pretty strong opinion on this, I read the original article and saw that in almost all parts of the training sample, only English was left, and the rest of the languages ​​​​were filtered out. The only exception is Wikipedia. For her left 20 languages ​​with Latin or Cyrillic, including Russian.

It seemed that over the past couple of years it would be possible to unlearn to predict anything.

It seemed that over the past couple of years it would be possible to unlearn to predict anything.

Needless to say, I was wrong.

And it worked! It worked better than all the Russian basic models that existed at that time.

At this point, Saiga’s story began in earnest.

Base

A language model is a model that predicts the next word from previous ones. Or a symbol, or a token, that is, part of a word. Sometime before 2017, the most famous language models were N-gramwhich simply stored which words came after which, and calculated probabilities based on that.

Haven't read Lena Vojta's course?  Seriously, didn't you read it?  Well, read on!

Haven’t read Lena Vojta’s course? Seriously, didn’t you read it? well, like this read!

There were language models based on recurrent networks, and your humble servant used them even then, but in the applied aspect, N-gram models were still more important. They were used in spell checkers and ASR to take into account the context.

Of course, one can say that word2vec from 2013 is also a kind of language model, but then no one really spoke like that.

In 2017 appeared transformersappeared in 2018 GPT And BERT, and language models have become the foundation of the entire field of natural language processing. At the end of 2022 ChatGPT releasedand language models got to the layman, and corporations and scientists began to try to repeat it.

Model

There were several variants of the basic models. First, there were the ancient small models of Sberbank: rugpt3medium and rugpt3large, which have the same name from GPT-3. Secondly, there were multilingual models: xglm And mt0. But it came out in February. LLaMA. And she was damn special.

Remember the laws of scaling, which are also described here? Llama relies on chinchilla laws, Hoffman’s laws, DeepMind’s laws, laws according to which models need to be fed a lot of tokens in order to become good. And at the time of release, it was unique in this among large open language models.

And here is the picture again

And here is the picture again

Llama turned out to have another very important feature – a suitable breakdown into Russian tokens. This is unexpected, given that only Wikipedia was in her Russian dataset, which was less than 1% of the data. I still don’t really understand whether such tokenization was a coincidence, or someone deliberately thought it out. So far, there is no better open multilingual model with tokenization than Llama’s. Except RWKV-4 WorldOf course, but it’s still exotic.

Synthetic datasets

In December 2022, an article about self-instruct was published. It is also unique – the authors did not need GPUs for it. The authors manually compiled 175 sample instructions and examples of their execution, and asked GPT-3 to generate similar ones. And then we trained GPT-3 itself on these instructions via the API.

The resulting model was pretty close to the level of InstructGPT, which was trained on instructions written by hundreds of people and examples of their execution.

Grabbing myself by this pigtail, I pulled it up with all my might and without much difficulty pulled both myself and my horse out of the swamp, which I tightly squeezed with both legs, like tongs.

In March 2023 appeared Alpaca, a Stanford project with the same idea as self-instruct, but with additional training not from GPT-3, but from LLaMA. In fact, the guys pulled the answer style from GPT-3 and transferred it to the open base model.

Distilling via API is brilliant!

Distilling via API is brilliant!

And we did the same, but for Russian. They made it cheaper, because they used the gpt-3.5-turbo already available at that time. And we also ran each task with a separate API call. It turned out something like this:

From the presentation to DataFest

LoRA: Low-Rank Adaptation

Retraining even a 7 billionth model a couple of years ago seemed like a daunting task, mainly because of the VRAM requirements. The heuristic here is this: for 16-bit models, you need to multiply the number of billions by 2, and you get approximately the number of gigabytes that this model occupies. For 7 billion, that’s 14GB. But in addition to the model itself, you also need to store its gradients and the state of the optimizer, and we get out of the typical 24GB VRAM.

With the advent of adapters, everything has changed. It turned out that the original model does not need to be touched at all, you can just hang new weights and train only them. Or train only small pieces of the original model, for example only bias. Or teach virtual tokens. Or not just hang new weights, but also build them in parallel old, reducing the dimension by multiplying two low-rank matrices.

Actually, the last option is called LoRA, and the whole set of methods is called PEFT, parameter-efficient fine-tuning.

This is the whole point of LoRA.  Seriously, everything.

This is the whole point of LoRA. Seriously, everything.

In our context, LoRA allows you to save a lot on VRAM, because the original model is frozen, and there is no need to calculate and store gradients from it. LoRA itself allows you to train 7 billion models in 16 bits on cards with 24GB of VRAM.

Quantization

But even with LoRA, training a 13 billion model is no longer possible in 24GB of VRAM, but we have done this many times. How so?

In fact, 16 bits of precision is overkill. Most of the weights of the trained model are normally distributed within their layers, usually close to zero (see Appendix F), and we don’t need to represent numbers like 65504. int8 instead of float16 is more than enough.

In August 2022, an article was published called LLM.int8. In it, the authors, in particular Tim Dettmers, have carefully looked at the problems of quantization by rounding to the nearest integer. It turns out that in the activations (not in the weights!) of the network there are outliers that break this very quantization. Well, the authors made a crutch for these emissions. About the nature of emissions, by the way, not so long ago there were interesting disputes.

Very clear diagram from the article

Very clear diagram from the article

But more importantly, unlike the authors of the other hundreds of papers on quantization, they integrated own method in HuggingFace Transformers, a notorious library, which allowed us to use it in one line. That is, we freeze the model in 8 bits, and learn 16-bit LoRA.

Now the heuristic is even simpler: the number of billions is approximately equal to the number of gigabytes. And 13 billion models are trained in 24GB with a margin.

This is just the tip of the iceberg: there is gptqThere is qlorathere are hundreds of articles mentioned above about different quantization options.

Run on CPU

Another thing that was unimaginable a couple of years ago was the launch of the 13 billionth model on any laptop, and look at that on a phone.

There is such a person Georgy Gerganov. He wrote a library for language model inference in C, ggml (where gg is obviously his initials), as well as the model serialization format of the same name (which has recently become gguf). On this basis he made llama.cppa specialized library for Llama’s inference.

Actually, models with HuggingFace can be converted to this format and used on the CPU with a decent inference speed, which we actively use. 2 of our 3 active demo use ggml models.

Also, ggml has its own quantization methodsup to dishonest 2 bits.

results

There are two main tests: a side-by-side comparison on 176 tasks and EnglishSuperGLUE.

Instead of a thousand words

Instead of a thousand words

Results side by side with ChatGPT-3.5, numbers mean wins-draws-losses:

  • gigasaiga vs gpt3.5-turbo: 41-4-131

  • saiga2_7b vs gpt3.5-turbo: 53-7-116

  • saiga7b vs gpt3.5-turbo: 58-6-112

  • saiga13b vs gpt3.5-turbo: 63-10-103

  • saiga30b vs gpt3.5-turbo: 67-6-103

  • saiga2_13b vs gpt3.5-turbo: 70-11-95

  • saiga2_70b vs gpt3.5-turbo: 91-10-75

Comparisons were made in Toloka, each pair of answers was viewed by 5 people.

On RSG, basic Llams are no different from Saiga if you train them further. LLaMA-2 13B itself is in 4th place, after people, ensemble and Fred. You can also look at zero-shot and few-shot. ChatGPT has a final score 68.2% in zero-shot mode, the 70 billionth Saiga has 64.3%.

In short, the 70 billionth Saiga is quite at the ChatGPT-3.5 level, but it’s just not very convenient to use it, even quantized.

There are several demos where you can touch the models. I pay for them out of my own pocket, so most of them are on the CPU:

Old demo: "why is the grass green?"

Old demo: “why is the grass green?”

Old demo: extracting json

Old demo: extracting json

Instructions for running and learning on your machines can be found in the repository and model cards.

Modernity

A lot has changed since the first version:

  • The model is now called Saiga

  • LLaMA became the second

  • Many different datasets have been added: from mefrom OpenAssistantfrom good people

  • The ggml developers changed the model format at least three times

The essence remains the same: we teach a large basic model to respond in Russian and put it in the public domain.

There were also models from Yandex and Sber, YandexGPT And GigaChat. The first received the second version today. According to our measurements, which perhaps should not be trusted at all, GigaChat is better than YandexGPT and at the level of ChatGPT. It is better at least because it does not censor a significant part of requests. None of them are open, only the basic one is available. 13 billion model Sberbank.

All our models and all links can be found in the repository: IlyaGusev/rulm.

Report from Datafest: video, slides. I won’t give a link to the Telegram channel, I don’t have it.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *