Irbis-7B or how we taught LLM the Kazakh language

Start

Language models based on the Transformer architecture, such as Llama, Mistral and others, show impressive results in English. However, their effectiveness in other languages, including Kazakh, may suffer. Additional training for a separate domain, even if you have a good dataset, may not provide a significant increase in quality. And the point is not so much that the base model saw little text in Kazakh during training, but rather ineffective tokenization. This shortcoming results in models not being able to use their full potential in languages ​​other than English.

Let's take Mistral as an example (don't panic, it uses the same BPE tokenizer with a dictionary of 32k tokens)and also Gemma (whose dictionary is adapted to work with different languages, and therefore is 8 times larger) and try to break the text into tokens. By the way, a token is the minimum meaningful unit of text (can be a word, part of a word, or even a separate morpheme), which is processed by transformers. As a result, this is what we get:

The text from the example above is Manyzdy halykka aser etken sayasi sheshimdi anyktanyz.  Translated into Russian as “Identify a political decision that affects a significant part of the population”

The text from the example above – Many people think that this is a bad idea.
Translated into Russian as “Identify a political decision that affects a significant part of the population”

While the same text in English would take 10 tokens, in Kazakh it is still 2-3 times larger. This results in not only poor learning efficiency, but also a low response generation rate (given that identical models predict tokens at the same speed, the one whose tokenizer splits the word into several tokens needs to spend more time to generate the entire text, instead of 1-2). Plus, don't forget about the context window, which will fill up much faster if you don't fix this problem.

On the one hand, Gemma's tokenizer looks more attractive, judging by the example above, and on the other hand, considering that when predicting, the model must determine the probability for each of the 256k tokens from the dictionary (instead of 32k for Mistral) interest in it is slowly fading away due to unnecessary complication.

Tokenizer

Mistral's tokenizer dictionary contains 32k tokens, mostly English lexemes (there are also in Kazakh, but there are very few of them). It was decided to expand the dictionary no more than 2 times. For this purpose, we collected about 20 GB of raw texts of all kinds of news, articles, etc. (mainly in Kazakh and 5-10 percent in Russian), determined the size of the new dictionary (the same 32k) and taught the tokenizer. The results looked noticeably better even with such a small dataset:

Final tokenizer dictionary (after combining the original and our new one, as well as after removing duplicates) contains a little over 60k tokens. Next we have a difficult, long and expensive stage, namely preliminary training.

Pre-train

Since the model is not familiar with our new tokenizer and such sizes are new to it, we need to initialize those layers that will “interact” with it, namely the input (embed_tokenstranslates the tokens from the screenshots above into understandable vector representation models) and output (lm_headconverts vectors into the probability of generating a particular token). After this, the new layers will not be good friends with what is between them (the “body” of the transforming model itself)and instead of meaningful text, generate garbage from random tokens, so the next step is to teach them.

Given that the main goal at this stage is to train the model to continue the text intelligibly, we use the same general dataset of raw texts that we trained the tokenizer on. It is small (only 2B tokens) and not the cleanest, but as an experiment for building a prototype model it should be enough.

Of the capacities at our disposal, we had 2xH100, which we mercilessly exploited for a week. The model learned smoothly, losses (both train and eval) fell (showing in every way that 7 days is not enough for her), the samples showed better and better results. At the end, the equal loss stood at 1.75 (at the start it was as much as 32.46, since with the extended dictionary of the tokenizer we tried to force LLM to write text in a language it was not very familiar with). Here are some good and bad examples in the question answering task (I highlighted in red what our new model generated):

Surak - question, Zhauap - answer

Surak – question, Zhauap – answer

Although the results are still far from perfect, they are already much better than the way Llama-2-70b or other open source LLMs currently generate text in Kazakh. Similar models either try to respond in English (as much as they understood the question), or they just throw out random words in Kazakh:

And so we have come to where we started, but now we have a model that works more or less with the Kazakh language on our new tokenizer. In other words, we have “grafted” a new language onto the model. Now Kazakh is translated into vectors that have the correct and understandable meaning for the model — as it used to be for its native English. The next rule of good form is to fine-tune it.

Fine tuning

To put it simply, this stage is needed to obtain more noticeable improvements in the quality of the model (as a rule, on some narrow tasks or specific data). If earlier we taught her to simply continue the text, now we need to force her to give a specific answer (including using additional information provided in the context – this is how we teach the model not to make things up on the fly, but to rely on what already exists). To do this you need a good dataset containing instructions (questions), contexts (potentially containing information for an answer), and the answers themselves. At the same time, the quantity of data plays a lesser role than the quality. For example, researchers from FAIR achieved success with only a thousand examples, a lot of effort was spent on manual selection and filtering.

Mostly, such datasets are compiled in English. Everything else is usually just a translation via DeepL or Google Translate. The Kazakh language is no exception, so we collected what already exists, and also translated several small datasets ourselves. As a result, our dataset contained translated Alpaca, Dollyand many other small ones (including datasets for DPOFor example belebele). After a little manual cleaning, we ended up with about 200k samples, for which we clearly defined the prompt template and distributed special tokens (bos, eos, etc.) along its edges, so that the model learns to understand when the answer should be completed, and not continue generation indefinitely.

At this stage (as well as the previous one) used HuggingFace libraries such as transformers, peft, trl And datasetsBesides them there were also deepspeed, bitsandbytes And flash-attention — these were responsible for the efficiency of additional training, and allowed us to obtain better results in the same time frame. The pre-trained model was quantized to 4 bits and tuned LoRA for all layers (thanks for this QLoRAand in particular the fact that you can freeze the model, reduce its accuracy and then restore the original weights on the fly through double dequantization). At the end of the training, namely after about 12 hours, our model looked strikingly better than the pre-trained one (and the basic Mistral-7B, of course). During this time, the equal loss fell from 3.39 to the final 0.74, and the model learned to extract information well from a small context:

Question: At what altitude does the International Space Station orbit the Earth? Answer: Average altitude 420 km

Question: At what altitude does the International Space Station orbit the Earth?
Answer: average altitude 420 km

Question: What are the main advantages of 5G technology? Answer: High data transfer speed

Question: What are the main advantages of 5G technology?
Answer: high data transfer speed

Question: What is the punishment for organizing an illegal gambling business? Answer: two thousand monthly calculation indices

Question: What is the punishment for organizing an illegal gambling business?
Answer: two thousand monthly calculation indicators

Unfortunately other operations (for example, summarization or paraphrasing) work worse (there were few good samples in the dataset, so for now it can only extract simple information – there is definitely room for the future! We have already started preparing datasets), but answers questions without context quite extensively and often correctly. Here are some examples:

Question (KZ/RU)

Answer (KZ)

Answer (Translated to RU)

What do you think?
/
Why is the grass green?

The process of photosynthesis is the source of chemical energy, which is the source of energy for the human body and the source of natural gas. The gas industry is a major source of energy for the development of the gas industry.

Plants convert sunlight into chemical energy through photosynthesis, which produces oxygen and carbon dioxide in plant cells. Carbon dioxide accumulates in plant leaves, turning them green.

Why don't you tell me about the problem?
/
What to do on Sunday evening?

From the moment you start reading a book, you will be asked to write or watch a movie, and you will be asked to leave.

On Sunday evening, you can do several things, such as reading a book or watching a movie and talking with friends.

Why should we call each other?
/
How many days are there in a year?

12 ai, 365 kunnen turada

Consists of 12 months and 365 days.

Who is the president of Kazakhstan?
/
Who is the President of Kazakhstan?

President – Kassym-Jomart Tokayev

President – ​​Kassym-Jomart Tokayev

Where are you from?
/
What is a sense of life?

The main thing is that you should not forget about it. This is because we believe that some people may be harmed by this decision or by other means. The main thing is that the people who live in the city are not afraid of anything. In order to achieve this goal, we must first take care of the people and then continue to do so.

For many people, the meaning of life is joy and happiness. This is not true because many people are looking for what they think is happiness or what they can achieve. The meaning of life is the sum of our daily actions. We need to understand how we can shape our lives, how we can change them and how they affect us.

Tarbienin manyzdylygyn zhane onyn adam omirine aserin sipattanyz?
/
Describe the importance of education and its impact on a person's life?

Tarbiye – bul adamnyn omirinde manyzdy rol atkaratyn, olardyn zheke basyn kalyptastyratyn zhane olardyn aleuetіne аser etetіn process. Bul adamdarga ozderinin zheke tulgalary men kundylyktaryna nazar audara mukkindik beredi, olar oz omirlerinin manni men magynasy bar ekenіn sezinuge Omektesedi. Sonday-ak, ol adamardy ozderin zhaksy tussinuge zhane basqalarga kalay әreket etuge үiretetin dagdylardy damytuga komektesedі. Sonymen qatar, tәrbie sonymen bіrge adamdar үшін құнды құндилћқтарди қаліптастѓруға зәне олARDы құрдастын құрмATEуге шақыгары.

Education is a process that plays an important role in a person’s life, shaping his personality and influencing his potential. It allows people to focus on their personality and values, helping them feel that their lives have meaning and meaning. It also helps people develop skills that help them better understand themselves and how to relate to others. Moreover, education also encourages people to develop valuable values ​​and respect for their peers.

In conclusion

The final model showed excellent potential for further training (metrics will be presented later). She answers simple questions well and is able to work with context (albeit weakly for now) and, thanks to a more efficient tokenizer, the speed of generating text in the Kazakh language increased by approximately 3-4 times compared to the basic model – and at the same time, the cost of generating responses decreased. The capacity of the context window has increased by the same amount – now you can feed several pages more text into the model.

Our experience in adapting a language model for the Kazakh language shows that improved tokenization and pre-training in the target language can significantly improve the quality and speed of the model. This approach can be applied to other languages ​​that were less represented in all popular open source models.

Download

Development is led by the Gen2B team[ссылка удалена мод.]

Authors:
– Denis: @bomze
– Igor: @seeyouall
– Armen: @armen

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *