Llama 3.1 and Mistral Large 2

Last month two interesting models were released – Llama 3.1improved version Llama 3And Mistral Large 2.

The most noticeable difference Llama 3.1 from previous models – it has a version 405B- 405 billion trainable parameters. It is the largest open language model, and published metrics show its performance on par with GPT-4. Tests were conducted on both general benchmarks, such as MMLU, and specialized ones, such as code and math.

I found the improved multilingual capabilities of this model particularly interesting, as I have been experimenting with training LLM on multilingual data for a long time, my latest model ruslandev/llama-3-8b-gpt-4o-ru1.0 outperformed GPT-3.5 on the Russian version of the benchmark MT-Bench.

Llama 3.1 supports seven languages ​​except English – French, German, Hindi, Italian, Portuguese, Spanish and Thai. There is no Russian in the list, as you can easily notice, but this does not mean that there are no examples in Russian in the base model case. There are, and more than enough, this becomes obvious during fine tuning. I have my own dataset for fine tuning ruslandev/tagengo-rus-gpt-4owhich I generated from mostly Russian-language prompts from the Tagengo dataset using GPT-4o.

Now about the disadvantages of the Llama 3.1 model – finetuning the 405B version will be expensive, since even with 4bit compression it is necessary to allocate about 200 GB of VRAM for such a task. Therefore, I finetuned the 8b version on the above-mentioned dataset, renting two A100 video cards on a cloud service immers.cloud. But I didn't notice any particular advantage of version 3.1 over version 3. On the contrary, I encountered several problems – for example, 3.1 after fine-tuning on my dataset showed a tendency to interrupt generation without completing the answer – I never got to the bottom of the reason, but Llama 3 didn't have this problem.

By the way, if you also think that the 405B version is too heavy to run on your hardware, you should pay attention to the model Mistral Large 2which came out almost simultaneously with Llama 3.1. This model has 123 billion parameters – more than three times less than Llama 3.1 405B. But here are some interesting benchmark results that can be used to compare these two models.

Mistral defeats Llama on MT-Bench:

And also on code generation and math tasks:

At the same time, it is obvious that the Mistral Large 2 inference is cheaper.

I haven't tried finetuning Mistral yet – Llama, in my opinion, has more tools for this, including official llama-recipes scripts that support FSDP – Fully-Sharded Data Parallel, an effective way of distributed finetuning, when not only data (as opposed to DDP – Distributed Data Parallel), but also model parameters and gradients are parallelized on several video cards.

So at least 8B version of llama 3 and 3.1 remains excellent material for AI development, with its lightweight and high performance.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *