Which models will cope with your tasks?
In recent years, large language models (LLMs) have become an important part of AI-based business solutions used for text generation and data analysis. However, most developments are focused on English-language projects, which creates difficulties for companies working with Russian-language data.
Ready-made LLMs for the Russian language often show low accuracy and limited capabilities. Privacy concerns are also forcing companies to opt for local models.
Our company has been working in artificial intelligence for a long time and began to often receive similar requests from clients – creating an AI solution with local data processing. We asked ourselves what LLMs are good for such solutions, what can we offer the customer? All this resulted in a large study of different language models.
In this article we will look at which LLMs are suitable for tasks in Russian, test them according to various parameters and identify the leaders. We evaluated text generation, question answering, error correction, and other features.
Contents (longread)
Research objectives
The purpose of this study is to evaluate several large language models (LLMs) that can be used for projects related to the Russian language, and to select the optimal solution for local work with documents and texts.
We selected the following aspects of working with text and data to identify the best LLM:
Text generation:
Generating Connected Text
Answers to known questions
Answers to questions in dialogue format
Correcting grammatical errors
Text structuring:
Brief retelling of the text
Answers to questions in the text
Extracting structured data
Writing SQL Queries
To use LLM in practice, the model must not only generate text well, but also cope with various types of queries: from correcting grammar to creating complex SQL queries. The skills of writing SQL queries are extremely important because it helps LLM to get the necessary information from the database, because the customer does not always use documents as input data.
Therefore, we tested each model against these tasks and assessed how ready it was for use in real business problems.
What models did we test?
As part of the study, 6 of the most promising, in our opinion, large language models (LLMs) available for working with the Russian language were tested. Each of which has its own unique features.
OpenChat 3.5 is an open-source multilingual model trained on a variety of data from various languages, including Russian. The model is designed to process queries, generate text, and perform other natural language tasks.
YandexGPT — a development by Yandex, optimized for working with the Russian language. The model is available through a paid API, which limits its use in local systems.
GigaChat – a model developed by Sberbank for multilingual support, including Russian. It is focused on text processing and content generation.
Mistral — a model from the French startup of the same name. At the end of September 2023, it was the best LLM with a size of 7 billion parameters. It is used for a variety of purposes, from writing code to generating content in a variety of languages, including Russian.
Saiga-Mistral-7b-Lora is a version of the Mistral model, additionally trained on a Russian dataset using LoRA (Low-Rank Adaptation) technology. The model is specially adapted for tasks in Russian.
Saiga-Llama3-8b — version of the Llama3 model, additionally trained on the Russian dataset. This is a powerful model capable of performing text-related tasks of varying complexity.
Research methodology
All tested models were evaluated under the same conditions on a wide range of tasks to objectively assess their abilities.
Scenarios
A unique scenario was developed for each task, some of which were based on real business cases. This made it possible to simulate situations that companies encounter in their daily work with documents in Russian.
Text generation: We asked the models to create text on a given topic, for example, a description of a tourist’s day on Mars or a formal order for a law firm. Coherence of the text, grammar and absence of insertions in English were assessed.
Answers to questions: The models were asked simple questions (for example, “Who wrote Romeo and Juliet?” or “What is the theory of relativity?”). The accuracy and brevity of the answers were assessed.
Answers in dialogue: here, the models' ability to retain the context of a conversation was tested when multiple related questions were asked sequentially. For example: “How much does the Moon weigh?” – “What about Mars?” – “And in pounds?” – “What is the distance between them?”
Bug fixes: The models were offered texts with errors, which they had to correct while preserving the original meaning and structure of the text.
Brief retelling of the text: We asked the models to retell long texts (both literary and legal) in a condensed form, preserving their main essence.
Answers to questions in the text: We asked the models to answer three factual questions about different texts—an artistic document and a legal document.
Data extraction: The models were presented with legal text from which they had to extract data such as names, dates, amounts, document types and present it in JSON format.
Writing SQL queries: the models were required to generate SQL queries based on the database description and query conditions. The correctness of the syntax and logic of the request were assessed.
Evaluation criteria
The following evaluation criteria were developed for each task:
Response accuracy – how correctly the model performed the task.
Coherence of the text – how logical and structured the text created by the model was.
Clarity of presentation – how clear the answer is to a person.
Grammatical correctness – checking for grammatical and stylistic errors.
Creativity – the extent to which the model shows originality in its answers.
Summarization skill—how well the model is able to compose a summary based on text.
Communication in Russian – to what extent the model can understand the context in Russian.
Each criterion was scored from 0 to 5, where 5 is the highest score that a model can receive if it perfectly performs the task assigned to it.
We also assessed the overall performance of the task, how well the models coped with the assigned tasks: 0 – the model did not cope with the task at all, 1 – coped with it, but with significant blemishes, 2 – coped almost perfectly.
Test environment
All testing was carried out under the same conditions, using Google Collab with the given resources:
This approach made it possible to provide equal conditions for each model and evaluate their performance on the same equipment.
However, if the model did not cope with the task the first time, the promt was edited to obtain a more accurate result. This made it possible to test the flexibility and adaptability of models to changes in requests.
Test results
Since each model was tested for a specific task, it would be logical to consider all 6 models in the context of each task.
1. Generation of connected text
Task: the models had to generate coherent text on a given topic. In this task, they wanted to test the quality of the generated text: grammar, presence of insertions of English words or letters, text coherence, narration style and relevance to the topic.
Two promts were used:
The second promt suggested drawing up a legal document:
YandexGPT: Showed high coherence and grammatical correctness of the text. The text was logical, stylistically correct and without errors. The model did not use insertions of English words, which made its result almost ideal for Russian-language projects (2 points for completion).
Saiga-Mistral-7b-Lora: It also showed excellent results, generating high quality text with good structure and minimal errors. The text was creative and fully consistent with the given topic (2 points for completion).
OpenChat3.5: The results were satisfactory, but there were insertions of English words and shortcomings in the structure of the text. The text could be coherent, but did not always fit the style or context (1 point per completion).
GigaChat: The model showed good results. The text was less structured and there were errors in the coordination of sentences, but the model still deserved a high score (2 points for completion).
Mistral: The text generation was not bad, but in some cases the model made syntax errors and did not always select the text style successfully (2 points for completion).
Saiga-Llama3-8b: The text was grammatically correct, but less coherent compared to the leaders. Occasionally there were minor inconsistencies in style (1 point per completion).
The best models for generating coherent text turned out to be YandexGPT And Saiga-Mistral-7b-Loraboth models provided high levels of grammatical accuracy and stylistic consistency.
2. Answers to commonly known questions
Challenge: The models had to answer general knowledge questions concisely and accurately. We assessed the accuracy of the answer and the coherence of the text.
YandexGPT: Showed excellent results. The model answered briefly, accurately and without errors. The correct response style was maintained without unnecessary deviations (2 points for completion).
Saiga-Mistral-7b-Lora: Did a good job of giving accurate answers, but sometimes the answers were a little longer than required (2 points per completion).
OpenChat3.5: The model often repeated the question in the answer, which made the answers less effective. There were shortcomings in brevity and accuracy (1 point for completion).
GigaChat: The answers were good, but sometimes the model deviated from the topic or gave overly detailed answers (2 points for completion).
Mistral: The answers were accurate, but not always concise. The model sometimes provided additional data that was not required (1 point per completion).
Saiga-Llama3-8b: Overall, I coped with the task and gave good answers (2 points for completion).
The best models for answering questions accurately and concisely were YandexGPT And Saiga-Llama3-8b.
3. Answers to questions in dialogue format
Challenge: Models had to maintain context and continue to answer questions in a dialogue format. We asked the question in several iterations because we wanted to test how well the model holds context. The accuracy of the answer and the coherence of the text were assessed.
YandexGPT: She did an excellent job. The model maintained the context throughout the entire dialogue, answered questions clearly and consistently, without losing the thread of the conversation (2 points for completion).
Saiga-Mistral-7b-Lora: Showed good results by maintaining context and answering questions consistently. Only in rare cases could the answers deviate slightly from the main topic (1 point per completion).
OpenChat3.5: The model did not handle dialogue well, sometimes losing context, especially with a long sequence of questions (0 points for completion).
GigaChat: The results were average. The model sometimes lost context and produced answers that were inconsistent with previous questions (1 point per completion).
Mistral: The model does not store context if questions are asked sequentially. But it answers the questions asked in one promt (0 points for completion).
Saiga-Llama3-8b: The model lost context during long dialogues and made errors in the sequence of answers (0 points for completion).
The best models for dialogue turned out to be YandexGPT, GigaChat And Saiga-Mistral-7b-Loradue to their ability to accurately maintain context.
4. Correcting grammatical errors
Task: Correct grammatical errors in the text while maintaining the original meaning. Compliance with the grammatical norms of the Russian language was assessed.
YandexGPT: The model performed almost perfectly. The errors were corrected correctly and the text became completely correct, but there were English words (1 point for completion).
Saiga-Mistral-7b-Lora: Showed good results. The model successfully corrected errors and produced grammatically correct text (2 points for completion).
OpenChat3.5: The model corrected the text, but left many errors and shortcomings, from which it was concluded that it failed (0 points for completion).
GigaChat: The results were almost perfect – all errors were corrected (2 points for completion).
Mistral: I corrected some of the errors, but not all of them were correct; it was concluded that the model failed (0 points for completion).
Saiga-Llama3-8b: I coped with the task, but sometimes left errors in the text (2 points for completion).
GigaChat, Saiga-Mistral-7b-Lora And Saiga-Llama3-8b showed the best results in correcting grammatical errors.
5. Brief retelling of the text
Task: the model must retell a long text while maintaining its main essence. Here we assessed the coherence of the text and whether the model would capture the main essence. The models were offered an excerpt from the work “The Idiot” by F. M. Dostoevsky, almost a thousand characters long.
As a second promt, they offered text from a legal document.
Saiga-Mistral-7b-Lora: Showed excellent results in summarizing the text, keeping the main points and making the text concise (2 points for completion).
Saiga-Llama3-8b: She coped well with the task, condensing the text, but sometimes missing some important details (1 point for completion).
YandexGPT: The retellings were accurate, but sometimes took a little longer than necessary (2 points for completion).
OpenChat3.5: The model generally retold the text, but sometimes missed important points or made the text less coherent and unattractive (0 points for performance).
GigaChat: The results were average – the retelling was either too short or lost key elements of the text (1 point per completion).
Mistral: The retelling was not precise enough and the text could lose important aspects (1 point for completion).
The leaders in this category were YandexGPT, Saiga-Mistral-7b-Lora And Saiga-Llama3-8b.
6. Questions about the text
Task: the models, after reading the given text, had to answer questions related to its content. The questions focused on key details, facts, or conclusions presented in the text. It was important to evaluate how accurately and concisely the model could extract the required information and answer the question while maintaining context.
We offered the models the same texts that we used for the brief retelling. The questions on the texts looked like this:
YandexGPT: Showed excellent results. The model gave accurate and concise answers to questions about the text, clearly following the context and maintaining logic (2 points for completion).
Saiga-Mistral-7b-Lora: Good at responding, maintaining context, and providing accurate answers, but sometimes tends to be overly detailed in her answers (2 points per completion).
Saiga-Llama3-8b: Quite accurate in answering questions, but sometimes misses details or gives less clear answers than leaders (2 points per completion).
OpenChat3.5: The model answers the questions, but with shortcomings, such as ugly text and language style (1 point for completion).
GigaChat: excellent results (2 points for completion).
Mistral: Showed poor results. Often gave inaccurate or incomplete answers, especially to more complex questions in the text (1 point per completion).
The best results in solving this problem were shown by YandexGPT. The model demonstrated high accuracy and brevity of answers, confidently retained context, and coped with both simple and more complex questions. Saiga-Mistral-7b-Lora also showed good results, but sometimes gave overly detailed answers, which could be unnecessary.
7. Extract structured data
Objective: Extract key data (names, dates, amounts, etc.) from legal text and present it in JSON format.
Saiga-Llama3-8b: The model performed well, but did not extract all the data (1 point per execution).
YandexGPT: Also did a good job at extracting data, but could sometimes make mistakes with complex queries (1 point per completion).
Saiga-Mistral-7b-Lora: She showed good results, but in difficult cases she could sometimes miss important details (1 point for completion).
OpenChat3.5: The model did not cope with the task at all (0 points for completion).
GigaChat: The results were perfect, the model always extracted data correctly (2 points per execution).
Mistral: The model performed worse than others, often losing data and providing it in an incorrect form (0 points for execution).
GigaChat has become the best model for extracting structured data.
8. Writing SQL queries
Task: Generating SQL queries based on a text description of the database and query conditions.
YandexGPT: Demonstrated excellent results, generating SQL queries with minimal errors. The model understood the structure of the database and could write logically correct queries (2 points for execution).
Saiga-Mistral-7b-Lora: Also showed good results, but sometimes made minor errors in syntax (2 points for completion).
OpenChat3.5: The model could not cope with the task, even if we started communicating in English (0 points for completion).
GigaChat: The model solved the problem, the results were good (2 points for completion).
Mistral: The SQL queries were incorrect, the model did not solve the problem (0 points for execution).
Saiga-Llama3-8b: The model coped with the task (2 points for completion).
YandexGPT, GigaChat, Saiga-Llama3-8b And Saiga-Mistral-7b-Lora were best at writing SQL queries.
Conclusions
The test results showed that each of the tested models has its own unique advantages and disadvantages. Depending on the specifics of the problem, one model may be more suitable than another.
We tested 4 LLM models that can be raised locally, and 2 cloud solutions (YandexGPT and GigaChat) on a number of synthetic tasks that show the capabilities of the models from different sides: generating new texts, or working with ready-made texts, their analysis and systematization.
Criteria-based assessment
We have summarized the assessments for each criterion and the problem being solved into a common table for clarity. The rating for each criterion can have a score from 0 to 5. Below is a table with the criteria by which we evaluated this model (YandexGPT responses were taken as reference answers):
Both cloud solutions coped well with synthetic tasks, and only 2 local models out of 5 local ones showed comparable results: Saiga-Mistral-7b-Lora and Saiga-Llama3-8b.
Task-based assessment
Below is a table with the results for all models, how well they coped with the tasks: 0 – the model did not cope with the task at all, 1 – it coped, but with significant blemishes, 2 – it coped almost perfectly.
Based on the results of this table, Saiga-Mistral-7b-Lora is the best choice as a local language model.
Conclusion
Based on two evaluation options, we conclude that the best solutions for working with Russian-language documents are the YandexGPT and GigaChat cloud platforms.
But in situations where you need to use a local language model, Saiga-Mistral-7b-Lora is worth considering.
The YandexGPT and Saiga-Mistral-7b-Lora models showed the best results in most tasks related to text generation, dialogs, and error correction.
Saiga-Llama3-8b has become the best choice for data extraction and document analysis tasks, making it an excellent tool for automating document processing.
In the future, with the development of NLP technologies, we can expect the emergence of even more accurate and productive language models that will be able to solve problems at an even higher level.
In addition, the possibility of additional training of models on specialized datasets (as was done with Saiga-Mistral-7b-Lora and Saiga-Llama3-8b) will allow them to be adapted to the specific needs of companies working with the Russian language.