How I Trained a Model That Understands Russian Better Than GPT 3.5 Turbo
In this article, I will show you how I was able to train a model that outperformed GPT 3.5 Turbo in the Russian-speaking part MT-Bench. I will also look at a new configuration for training on two GPUs in parallel using accelerate and deepspeed.
Of particular interest is my training dataset, which is derived from the subset of multilingual prompts for the set lightblue/tagengo-gpt4 in Russian, English and Chinese, with a total of 10,000 examples generated using GPT-4o. This is 8 times smaller than the original set. Tagengobut trained on the latest Suzume, as benchmarks showed, only slightly outperforms my model on ru_mt_bench, and on the English-language benchmark it is even inferior to it. This means that I saved several times on the GPU due to the higher quality of the data obtained using GPT-4o.
I used script to get answers for given prompts. To generate a Russian-language selection, I modified part of the script to select all prompts in Russian from Tagengo (8K examples), since the main focus when training the model was on the Russian language.
As a result, I received a dataset ruslandev/tagengo-rus-gpt-4o and began training.
For this I created a virtual machine with NVIDIA H100using the service immers.cloud. To achieve the best results on instruction-following (which is checked on MT-Bench) I took as the initial model meta-llama/Meta-Llama-3-8B-Instruct. This is what the model was trained on. Suzumewhich has a high score on MT Bench. Previous experiments have shown that the underlying Llama-3 8B, and especially its four-bit version for QLoRA – unsloth/llama-3-8b-bnb-4bit – significantly lags behind benchmark estimates.
This time I trained in parallel on two GPUs, for this I set a new configuration of my virtual machine – two NVIDIA A100.
I used the axolotl tool, which allows you to quickly configure and launch a training session.
My config axolotl Here.
After installing axolotl, which is described in documentationall that remains is to run the training with the command:
accelerate launch -m axolotl.cli.train config.yaml
Accelerate – is a Huggingface library for distributed learning.
axolotl launched two parallel processes with model shards for each of the two GPUs. Training for one epoch lasted about an hour, the final train loss – 0.8.
The result exceeded my expectations – third place in mt_bench:
model | score |
gpt-3.5-turbo | 8.25625 |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 | 8.22500 |
ruslandev/llama-3-8b-gpt-4o-ru1.0 | 8.01250 |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-full | 7.97500 |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-half | 7.97500 |
meta-llama/Meta-Llama-3-8B-Instruct | 7.97500 |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top75 | 7.93750 |
Nexusflow/Starling-LM-7B-beta | 7.92500 |
lightblue/llama-3-8B-multilingual-orpo-base-half-borda | 7.84375 |
lightblue/suzume-llama-3-8B-multilingual-aya | 7.83125 |
lightblue/suzume-llama-3-8B-multilingual-orpo-naive-full | 7.78750 |
lightblue/suzume-llama-3-8B-multilingual | 7.73125 |
ruslandev/llama-3-8b-gpt-4o | 6.61875 |
My model has surpassed llama-3-8b-instruct and most versions Suzumeexcept for the strongest of them. This is on the English-language benchmark.
Now – the result of ru_mt_bench:
model | score |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-half | 8.94 |
lightblue/llama-3-8B-multilingual-orpo-base-half-borda | 8.86 |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 | 8.84 |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top75 | 8.46 |
lightblue/suzume-llama-3-8B-multilingual | 8.36 |
lightblue/suzume-llama-3-8B-multilingual-orpo-naive-full | 8.32 |
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-full | 8.20 |
ruslandev/llama-3-8b-gpt-4o-ru1.0 | 8.12 |
Nexusflow/Starling-LM-7B-beta | 8.06 |
lightblue/suzume-llama-3-8B-multilingual-aya | 8.00 |
gpt-3.5-turbo | 7.94 |
ruslandev/llama-3-8b-gpt-4o | 7.36 |
My model scored 8.12, just shy of the Suzume and ahead of the gpt-3.5-turbo, which scored 7.94.
This is a very promising result, and there are several conclusions to be drawn. First, my dataset is eight times smaller than Tagengo's, meaning training was much cheaper than Suzume's – just two GPU hours.
I did not increase the English sample in my dataset, there are only a thousand examples in English, and the English MT Bench surprisingly showed an average score of 8 points. This means that adding more high-quality multilingual data improves the overall quality of the model, not just its performance in that specific language. This effect was already shown in Peter Devin's paper – Tagengo: A Multilingual Chat Dataset
I am very happy that I was able to see this idea implemented in practice. My dataset, model weights, and GGUF files are published on my Huggingface account.