comparing 16 open-source models on request with different temperatures

So, welcome to the cut, if you are interested in how to test sixteen LLMs at once, what results the test will show, and where to look if you need more serious results.

Why open-source

If you are faced with the task of generating Russian-language text, then most likely you will consider open-source models. Firstly, they are free, and secondly, you can trust them with some sensitive information.

You can also select a model based on the results on Chatbot Arenabut you’ll have to look for open-source models there. In addition, the rating is based on it according to Elo systemand this is not a very reliable metric. In chess, she will show that Magnus Carlsen is likely to beat Hikaru Nakamura, but is unlikely to answer whether he can fix the faucet better. There is also Russian leaderboard benchmark MERA. If you are scared by the number of tasks, domains and other toggle switches, just take the top one from the list for your tasks.

There are open-source models in the top, but, as a rule, they are heavy. In order not to manually go through the models we are interested in, we will use a text generation framework and analyze the results obtained, having previously decided on the quality criteria.

Next will be a description of a simple version of the LLM test, when a prototype is needed as quickly as possible. If you have time and desire, then at the end of the article I have collected links where you can see more complex options for adapting LLM to your needs.

LLM test

Download the open-source model locally

This can be done quickly and painlessly using the Ollama framework.

Step by step instructions:
  1. Download client for your OS.

  2. We go to the page of the model we are interested in from list.

  3. From the drop-down list, select the number of parameters we need and the type of model quantization (you can read about quantization and the GGUF format here).

  4. Copy the link to the terminal and download the model.

Now you can communicate with the model in the terminal, or you can screw web interface.

As an alternative, you can use models with HuggingFace. This is not the most reliable option, as you may be denied access to some models or you may receive errors about the resources used. Plus, approval takes some time, but I want to try it now.

Example for Llama 3.2 1b from HuggingFace:

import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
   tasl="text-generation",
   model=model_id,
   torch_dtype=torch.bfloat16,
   device_map="auto"
)
messages = [
   {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
   {"role": "user", "content": "Who are you?"},
]
outputs = pipe(
   messages,
   max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Also now can be used GGUF models of HuggingFace in the Ollama client.

Step by step instructions:

The arsenal is quite large; you can also upload your own pre-trained models.

List of assembled models for testing

We have collected the most popular open source LLMs that can run on most more or less modern laptops:

  • mistral-nemo:12b-instruct-2407-q4_0 (12b q4_0)

  • qwen2 (7b q4_0)

  • hermes3 (8b q4_0)

  • owl/t-lite:q4_0-instruct (8b q4_0)

  • llama3.1 (8b q4_0)

  • ilyagusev/saiga_llama3 (8b q8_0)

  • gemma2 (9b q4_0)

  • aya (8b q4_0)

  • phi3:14b (14b q4_0)

  • phi3.5:3.8b-mini-instruct-q4_0 (3.8b q4_0)

  • qwen2.5:7b-instruct-q4_0 (7b q4_0)

  • mistral-small:22b-instruct-2409-q4_0 (22b q4_0)

  • solar-pro:22b-preview-instruct-q4_0 (22b q4_0)

  • nemotron-mini:4b-instruct-q4_0 (4b q4_0)

  • glm4 (9b q4_0)

  • llama3.2:3b-instruct-q4_0 (3b q4_0)

The number of parameters in billions and the type of quantization are indicated in parentheses. I’ll show you what this means using the llama3.2:3b-instruct-q4_0 model as an example:

  • 3b: number of model parameters (3 billion). The more parameters, the better the model’s answers, but it requires more power to run;.

  • q4_0: application of 4-bit quantization with rounding to the nearest value (q). This design allows efficient use of resources and provides sufficient accuracy when working in systems with limited processing power. Each weight in the block is represented not by a full floating point number, but by four bits of information. Each block has 32 weights, and each of these weights is calculated by w = q * block_scale, where q is the quantization value for a given block and block_scale is the scale factor for a given block.

How we will test LLM

We will evaluate the models' ability to copehappen with text generation based on a simple request. We will analyze the change in the quality of the generated Russian-language text with temperature changes for each of the models. This way we will get an approximate understanding of which of them is preferable to take for a baseline solution, subsequent additional training and integration into our products.

In the context of LLM, temperature is a parameter that controls the randomness of the next word selection in the model. The higher it is, the more likely it is that the model will choose a less predictable and rare variant of the word.

Manually sorting through dozens of models is not the most rewarding task, so let’s automate the experiment using a framework Autogen:

!pip install pyautogen

from autogen import AssistantAgent

'''
Конфиг:
model - название вашей модели из списка ollama list
base_url - url по умолчанию для клиента
api_key - ключ по умолчанию для клиента
cache_seed - отключаем кэширование, чтобы не получать один и тот же результат на идентичные запросы
temperature - наш ползунок фантазии
price - работая с локальной моделью, зануляем стоимость, чтобы избавиться от надоедливых сообщений при работе
'''

llm_config = {
   'model': 'mistral-nemo:12b-instruct-2407-q4_0',
   'base_url': 'http://localhost:11434/v1',
   'api_key': 'ollama',
   'cache_seed': None,
   'temperature': 0.2,
   'price': [0, 0]
}

agent = AssistantAgent(
   name="Агент",
   llm_config=llm_config,
   max_consecutive_auto_reply=None,
   system_message=""'
   Вы - очень креативный автор, который пишет приятные лаконичные поздравления.
   '''
)

prompt_template=""'
Напиши короткое поздравление для {name} с профессиональным праздником {day} в {style} стиле.
'''

name="Александры"
day = 'День программиста'
style="официальном"

def prompt_message(sender, recipient, context):
  return prompt_template.format(
      name=context['name'],
      day=context['day'],
      style=context['style']
  )


output = agent.initiate_chat(
   recipient=agent,
   silent=True,
   max_turns=1,
   message=prompt_message,
   name=name,
   day=day,
   style=style
)

print(output.chat_history[-1].get('content'))

In this example, the prompt_template template is generated dynamically using the passed arguments: name, day and style. It is used to create a personalized greeting.

We also move the temperature of the models in increments of 0.1 in the range from 0.1 to 0.9 (inclusive), and generate text for 100 iterations for each temperature with the following message:

“Write a short congratulations to {name} on their professional holiday {day} in {style} style.”

At the same time, you can check how often the model will come up with dates for the specified holiday and how well she understands the details of the request: the gender of the recipient and the style of text we need.

Limitations of such a test

In this article I describe the simplest task: it can be completed “on your knees” and in a short time. For example, in system_message we provide a context for writing text without unnecessary manipulations in the spirit of “write only in Russian and remember: my life depends on it!”

Changing system_message can have a significant impact on the output and its style. For example, changing “You are a very creative writer” to “You are a formal writer” may result in a more formal and formal congratulation. Or if you remove the word “laconic”, the length of the generated text will significantly increase.

Several options for using system_message:

Using the mistral-nemo model for temperature t = 0.1 as an example, I will give several options for using system_message, with the help of which we will offer to play a role:

  1. We will not fill in, leaving it as default:

DEFAULT_SYSTEM_MESSAGE = '''
You are a helpful AI assistant.Solve tasks using your coding and language skills.
In the following cases, suggest python code (in a python coding block) or shell script (in a sh coding block) for the user to execute.
   1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, get the current date/time, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
...
...
...
'''
  1. Now we ask you to write personalized congratulations only in Russian:

system_message=""'
Вы - русскоязычный автор, который пишет поздравления согласно инструкции.

%Инструкция:
- поздравление должно быть исключительно на русском языке
- поздравление должно быть персонализировано
'''
  1. and now we’ll correct it a little by prohibiting making mistakes, using emoticons and other symbols, and generally writing shorter:

system_message=""'
Вы - русскоязычный автор, который пишет поздравления согласно инструкции.

%Инструкция:
- поздравление должно быть исключительно на русском языке
- текст должен быть лаконичным и доносить основную мысль
- в тексте не должно быть ошибок, смайлов и прочих посторонних символов
'''
  1. Finally, we will briefly describe the desired role to play, taking into account the recommendations:

    “This is likely fine when the system prompt is short and sweet, but can lead to problems when the instructions are very long (eg, with the AssistantAgent)”

system_message=""'
Вы - очень креативный автор, который пишет приятные лаконичные поздравления.
'''

system_message

Median number of characters in text

Total number of characters in the text

Spelling errors

Extraneous characters

English language

Other languages

1

314

34 510

0

574

10

0

2

290

28 960

0

468

8

0

3

172

18,046

0

0

0

0

4

203

20 426

0

0

0

0

A more detailed example of LLM testing is described in the article “Data Search, Improved Training and Judging. How to improve the quality of additional LLM training with minimal resources.” For more serious evaluation of models, you can use benchmarks, LLM judges and manual SBS evaluation.

Evaluation of answers

When assessing this task in the context of choosing a model for generating Russian text, it is important to take into account the quality of the generated language, the ability to take into account context and style requirements. Ideally, the model should be trained on Russian-language data and should take into account grammatical rules, slang and stylistic features.

Examples of generated text for each of the models:
  1. mistral-nemo: Dear Alexandra! On this day, when we honor those who write code and create the future, we would like to congratulate you on your professional holiday – Programmer's Day. We wish you creative success, new achievements and constant growth in your profession. Happy holiday!

  2. qwen2: Dear ALEXANDRA,\n\nCongratulations on your professional holiday – Programmer's Day! Your vision and skill in this difficult, but so important matter for us, inspires and delights. May every new day bring new ideas and success in your favorite business.\n\nWith best wishes,\n[Ваше имя]

  3. hermes3: \nHeartfelt congratulations on your professional holiday, Alexander! Your talent and diligence do not go unnoticed. Your codes are an example of mastery. May every new day bring success and inspiration to your creative work.

  4. t-lite: Dear Alexandra!\n\nI sincerely congratulate you on Programmer’s Day! Your tireless work, creativity and deep knowledge contribute to the development of modern technologies. I wish you new achievements, inspiration and recognition of your work.\n\nWith respect,\n[Ваше имя]

  5. llama 3.1: Congratulations! You are the best programmer, and we are the happiest users!

  6. saiga_llama3: Dear Alexandra, we congratulate you on your professional holiday – Programmer’s Day! Your work and contribution to the development of modern technologies is highly valued. We wish you good health, success in your work and new achievements!

  7. gemma2: Dear Alexandra! \n\nCongratulations on Programmer's Day! We wish you successful projects, logical code and inspiration in your work. \n\n\n

  8. aya: Dear Alexandra!\n\nOn this professional holiday, known as Programmer's Day, we would like to congratulate you on your outstanding contribution to the world of programming. Your skills and prowess in writing code inspire everyone around you. We wish you new achievements, creative ideas and successful implementation of all your endeavors.\nHappy Programmer's Day!

  9. phi3: \ntitle: The following is a)\t (\n\n;e and the first time-specifically, there are no. \nIn the \n\n### question\nThe storyline = 2014).\n\ n\n \nAlice in “Theory of the \n\n\nQuestion:**\n\ntitle: The following is a)\t \n\n\n; \n\nAnswer:\n\n* '' ' (\n

  10. qwen2.5: Dear Alexandra,\n\nOn this special day, Allow me to extend to you our warmest congratulations on the professional holiday of the Programmer.\n\nYour professionalism and talent are always recognized and valued in our team. May every day of work bring you satisfaction and success.\n\nBest wishes,\n[Ваша Фамилия/Компания]

  11. phi3.5: \n\nI's, anderer \n\n\n\n\n问ing Analy FA\n\nquestion: Based on the\n\n\n\n\n\n\n\n\n\n跟\n\n\n\n\n\n\n\n\n**Q:10\n\n\nBased on a) Human Resources israel \n\n\nThoughts of natural language modellier, which one asherildessay I am Anfang Inputt \n\n# \n\nuser \n\n\n\n\nQuestion \n\n—\n\n在 \n\nUser \n\n�cted \n\n\n \n\n\n\n\n\n\nproblem \n\n\n\n\n\n\n指宿 \n\nInstruction: AI :\n\n # Q \n\n\nQ: Human Resources and the documentary for a) Whatshot \n\n\ninstantiated byta \n\n\n\n蒙 Tips to beant \n\n\n\n\n*Außooma, asynthesis Question \n\n### Q \n\nSalesforce. I's answer \n\n\n\n\n指定 \n\n

  12. mistral-small: Dear Alexandra!\n\nI cordially congratulate you on your professional holiday – Programmer’s Day! Your skills and dedication to your craft deserve special mention. May every new day bring you success, and may every line of code open up new opportunities for you.\n\nWith respect,\n[Ваше имя]

  13. solar-pro: \nCongratulations to you, Alexandra, on Programmer’s Day! Your work in creating high-quality models greatly contributes to the development of our company. Thank you for your contribution and I wish you even greater success!

  14. nemotron-mini: “Dear Alexandra, I hope this message finds you well on your special day as a programmer! Wishing you all the best in your career and personal life.”

  15. glm4: Dear Alexandra!\nCongratulations on Programmer’s Day! Your mastery of code and ability to solve any problem at the highest level is admirable. We wish you new creative success, inspiration and professional growth in the world of digital technology. Happiness and good luck!\n\nWith respect,\n[Ваше имя/Организация]

  16. llama 3.2:”Today we celebrate not only the day of programmers, but also the day of talented people like ALEXANDRA, who make our world more innovative and interesting. Congratulations!”

In the generated letters, I estimated the total number of errors. I checked the spelling using language_tool_python, Python Yandex Speller And autocorrect. Using regular expressions, I estimated the number of English words in the congratulations, the presence of fictitious dates, and counted the number of tab characters. Additionally using langid counted the transitions to other languages.

Test results

Volume of generated text

First of all, let's see how many symbols each model generated for all temperatures. And also – by the median number of characters in the congratulation:

Model

Average total number of characters in the text

Median number of characters in text

mistral-nemo

22 424

224

qwen 2

34 002

335

hermes 3

25 527

227

t-lite

58 314

354

llama 3.1

20 463

146

saiga_llama 3

26,673

271

gemma 2

14,866

145

aya

46 703

404

phi3

21,883

153

qwen 2.5

30,990

310

phi3.5

48 284

331

mistral-small

29 406

296

solar-pro preview

17,805

175

nemotron-mini

25 540

248

glm4

26 160

270

lamma 3.2

11,939

109

Just in case, let me remind you: we wanted to receive a short congratulations.

In the answers by hermes3, t-lite and llama 3.1 the number of characters increases as the temperature increases:

Answers at different temperatures

For convenience and clarity, let’s combine temperature intervals:

  1. t = 0.1 – 0.3: high quality; low variability, but the result is predictable and retains its meaning.

  2. t = 0.4 – 0.6: average quality; the variability is noticeably better, but the original meaning is gradually beginning to fade into the background.

  3. t = 0.7 – 0.9: unbridled creativity. The result is sometimes surprising:

Hidden text

Very dear Alexandra,\n\ncomplain, someone, if you can, – but your technological magic gives a boost to the industry, makes swings in the markets and hides behind bars. A programmer's day without you is like nights without light-breathing breath.\n\nOn this special day, praise to your soul-programmer, who tried to connect the real and virtual world, what happiness and worthy hard work.\n\nThe more productive you worked, what you contributed to a sea of ​​bytes and stones in the encoding grid, the more clearly it is shown in the wash and on the ground:\n\na matryoshka doll, from which the whole world longs for such an amazing and rich ball as was found inside,\n\na trap from which you managed to get out, having seen from there a big picture and feeling like a super hero soaring up like a flying superhero,\n\nand, finally, a labyrinth, which you offered not only to rent with you, but, suddenly, invited you to play in it, repeating the story of a beggar who found himself in the kingdom of the rich, and remembering not only yourself, but the whole world for this.\n\nWe wish you stunning success, incredible impressions and even more unclean love for code.\n\nWith best wishes and serious gratitude.\n\n\nDate: [День программиста]\n\n[Ваше имя]\n[Ваша должность]\n[Вашая компания]

Greetings to the talented and hardworking computer smart people all over the world!\n\nOn a good evening, I learned that our country is celebrating the 256th IT Workers Day!\nOn this day we especially celebrate programmers.\nWith their help, our world has become the wisest and most functional! My friend Alexandra is a member of this most privileged group, so I want to wish her success on her new professional holiday!\n\nYour friend.

Dear Alexandra,\n\nOn the eve of Programmer's Day, we congratulate you on your professional holiday!\n\nFor Alexandra简短的程序员节祝福语吗?如果Alexandra Alexandra Alexandra Alexandra向您致以诚挚的祝福与崇高的敬意!\n \n祝您节日愉快,工作顺利,生活幸福美满!\n公司名称(可选填)\n\n请告诉我是否需要进行任何调整或是否有其他需求

Now let's count the spelling errors for each model. Let me make a reservation in advance that all three tools do not check the correctness of declension and gender usage. On the graphs I showed the number of errors for each model: blue bar – lang_tool, green – Python Yandex Speller, red – autocorrect:

Numbers and tabs

Let me remind you that the request specified a specific holiday, and we did not ask to invent a new date for it:

Switching to foreign languages

Since we work with open-source models that are not tailored to the Russian language, we may encounter a situation where LLM switches from it to another. Most often this is English, but it can be any other Slavic language. And, for example, in the case of qwen models – Chinese.

Let's count these errors:

Brief conclusion for each model

This summary is based on a specific set of criteria for evaluating language models: the number of errors, unexpected use of numbers, redundant tabs, and language transitions. Results may vary depending on the specific task, requirements, and complexity of the prompts used.

Test leaders:

  • mistral-nemo produces the most stable and pleasing result, taking into account semantic meaning, spelling and the absence of unnecessary characters. The only drawback is that occasionally Russian words can be replaced with English ones.

  • mistral-small – excellent result, almost no questions asked. Uses tabs to the point and in moderation.

Acceptable result, but there are a relatively large number of errors:

  • llama 3.1 produces a stable result due to low variability. There are tabs, but the address to the recipient is often missing, making congratulations seem too generic.

  • saiga_llama3 – good stable result, tab marks appear as the temperature rises.

  • glm4 — the result is worthy of the top three in this block. As the temperature rises, Chinese characters begin to appear, although quite rarely.

  • T-lite – generally good. There are tabs, and as the temperature increases, something like _ComCallableWrapper<|begin_of_text|> or _________\n\n__________ may appear.

  • gemma2 – generally capacious and good, stability at the level of llama 3.1. As the temperature rises, the number of English inserts and tabs increases.

  • qwen 2 – good with creativity, but there are problems with handling, tabs. Can switch to Chinese.

  • aya – as if it’s not so bad, but he can confuse gender when addressing, use a strange greeting: “Your Excellency.” There are redundant tab characters.

More problematic models:

  • hermes3 — tongue-tied: few epithets, in most cases only uses the phrase “congratulations to you.” There are problems with tabs and Chinese characters, but it doesn’t get confused when handling.

  • qwen 2.5 — ± level of its predecessor, but with a lot of random English words and Chinese characters.

  • llama 3.2 — at low temperatures, congratulations look good, and quite often epithets and names are written in English. Starting from t = 0.4, the semantic load of the generated congratulations is gradually lost.

  • solar-pro — you can get good results at low temperatures, but there are not many advantages in the task of generating Russian text. If there are lighter options, it is better not to use it.

  • nemotron-mini – the generated text looks good, but there is a nuance – it is predominantly in English, regardless of the temperature.

  • phi 3, phi 3.5 – a lot of mistakes. Conditional congratulations are observed only at a temperature of 0.1, in other cases – English words with markings.

In the LLM test, it is very important to initially set a metric to assess the quality of language models. Without clearly established evaluation criteria, it is difficult to make an objective choice of the best model.

For our task – simple generation of Russian-language text – the test showed that it is best to use Mistral-nemo or Mistral-small for the prototype. If there is a power limitation, then you can take llama 3.1, qwen 2, glm4, t-lite, saiga_llama3, aya or gemma 2, but you will need to check the results several times. The following conclusions can also be drawn:

  • With almost any model, you may need to use a typo correction tool. They can leak from any current open source LLM up to ~70b parameters. More powerful models will also make mistakes, but much less often. In this connection, there may be a need to use spell checkers or a second iteration of text checking, for example, with the same Autogen agent, but with detailed instructions for checking the text.

  • There may be (and quite a lot of) extra tab characters.

  • When working with numbers, it is better to insert them directly into the text or use detailed prompts for this case.

  • Unsatisfactory results in coordinating the correct grammatical forms: inversions, endings, cases, declension cannot be ruled out.

Such conclusions can already be shown to colleagues and management so that they can evaluate the quality of generation and the possibility of further creating a prototype.

How can you achieve more complex and accurate results?

The key to solving specific problems is additional LLM training. Models pre-trained on large text corpora may have a wide range of knowledge and abilities, but this is not enough for narrow and specialized tasks. For example, models trained on English texts will have grammatically correct, but semantically inappropriate or irrelevant Russian texts. To work with the Russian language, they will have to be additionally trained separately on a Russian-language dataset.

Detailed instructions and guides for each model will help you understand how to configure and optimize training for specific tasks, such as generating Russian text. Using them, you can understand the features of the model, additional training techniques, and parameters that need to be configured to obtain the best results. Here is a small selection of such instructions, where everything is described at the “select model, run all” level:

The following articles will help you dive into the issue of additional training of models:

Keep in mind that retraining a model takes time and resources, so it is important to plan this process in advance and ensure sufficient computing power for effective training. In addition, make sure that you have access to the data on which the model will be further trained.

That's all for me, but I'm ready to answer your questions. Thanks for reading!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *