Is Llama 3.1 405B really that good?

Meta recently introduced the Llama 3.1 405B to the world, a new open source model that challenges established leaders such as GPT-4o and Claude-3.5 Sonnet.

15 trillion tokens, 16,000 H100 GPUs, and improved reasoning and code generation capabilities are impressive specs. But can Llama 3.1 really compete with closed models? In this article, we conduct an independent investigation: we compare the capabilities of Llama 3.1 405B with GPT-4o and Claude 3.5 Sonnet on a range of tasks, from programming to creative writing, and try to understand how ready it is for practical use.

Enjoy reading! (:

What kind of animal is this?

Llama 3.1 405B – This is the largest model from Meta*, trained on a colossal amount of data – more than 15 trillion tokens. To train a model of this scale, Meta not only had to use enormous computing power – more than 16,000 NVIDIA H100 GPUs – but also optimize the training process itself.

According to Meta, Llama 3.1 405B significantly outperforms previous versions (Llama 1 and Llama 2) in such parameters as the length of the context window (now 128 thousand tokens), the ability to reason logically and write program code.

In addition to the 405 billion parameter version, smaller models are also available – with 8 billion and 70 billion parameters. More details on the characteristics of all three versions can be found in the table below.

Key hyperparameters

In addition to the core models, Meta also introduced Llama-Guard-3–8B, a special version of the 8 billion model tuned to classify data based on content.

Also, all Llama 3.1 models “understand” eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish and Thai.

Llama 3.1 was trained in two stages. First, in the pre-training stage, the model “absorbed” information from a huge text array (more than 15 trillion tokens), processing it on a powerful cluster of GPUs. Then, in the post-training stage, more fine-tuning took place, where supervised learning (SFT), sampling with deviation, and direct preference optimization (DPO) methods were used.

It is worth noting that specially prepared high-quality synthetic data were used at the SFT stage to improve the efficiency of the model in programming, mathematical calculations and tool use. More details about this process can be found in section 4.2.3 documentation.

The learning process

Meta claims that Llama 3 demonstrates impressive programming abilities: it generates high-quality code, demonstrating a deep understanding of the syntax and logic of programming. The model is capable of not only creating complex structures, but also successfully coping with a variety of tasks. In addition, Llama 3 excels in tasks that require logical thinking: it can reason, analyze, draw conclusions and find solutions even for non-trivial problems.

Benchmarks

Llama 3.1 has undergone rigorous testing: its capabilities have been assessed both on the basis of more than 50 datasets and with the involvement of experts. The results of the experiments showed that the model demonstrates performance at the level of such recognized leaders as GPT-4, GPT-4o and Claude 3.5 Sonnet. It is especially worth noting the ability of Llama 3.1 to effectively work with long texts – in the “zero scrolls quality” test, it received an impressive 95.2 points.

Llama 3.1 demonstrated superiority over GPT-4o in a number of benchmarks, such as GSM8K, Hellaswag, boolq, MMLU-humanities, MMLU-other, MMLU-stem, and winograd. However, in some disciplines, such as HumanEval and MMLU-social sciences, GPT-4o was stronger. It is also worth noting that Claude 3.5 Sonnet outperformed Llama 3.1 in BFCL and Nexus.

Manual evaluation by experts showed that overall performance of Llama 3.1 405B is comparable to GPT-4 and Claude 3.5 Sonnet, although slightly inferior to GPT-4o.

Code generation

Let's take 55th task on dynamic programming with LeetCode. Yes, I'll note right away that I'll write prompts in Russian, since we're more interested in comparing how the models work with the Russian language.

So, our prompt:

You are given an array of integers nums. You are initially at the first element of the array, and each element of the array represents the maximum jump distance from that position.

Return true if you can reach the last index, false otherwise.

Example 1:

Input: nums = [2,3,1,1,4]
Output: true
Explanation: Take 1 step from index 0 to 1, then 3 steps to the last index.

Example 2:

Input: nums = [3,2,1,0,4]
Output: false
Explanation: You will always land on index 3 no matter what. Its maximum jump length is 0, making it impossible to reach the last index.

Restrictions:

1 <= nums.length <= 10^4
0 <= nums[i] <= 10^5

Llama 3.1 405B:

ChatGPT-4o:

Claude-3.5 Sonnet:

So, the first thing that catches the eye is that Llama's answer is the least detailed. The description of the algorithm itself is superficial compared to Gpt-4o, but it still explains why this algorithm works. And speaking of code, Sonnet presented a more laconic code compared to Llama. Let's move on.

Mathematics

Next, let's take a simple math problem that requires careful formulation. Our prompt will look like this:

A brick weighs 1 kg and half a brick. How much does a brick weigh?

Llama 3.1 405B:

ChatGPT-4o:

Claude-3.5 Sonnet:

So, the models' answers are in front of your eyes. First, each one answered correctly, but still differently. Llama's answer can be characterized by clarity, accessibility, and slight informality. For example, compared to Sonnet, the detailing is much less, and compared to GPT, it avoids equations. Whether these are pros or cons depends on the goal and audience.

I suggest giving one more task:

A bus is traveling at 56 miles per hour. There are three people sitting in the last row and two in the second-to-last row. How many people are on the bus altogether?

Llama 3.1 405B:

ChatGPT-4o:

Claude-3.5 Sonnet:

Interestingly, GPT-4o and Sonnet ignored the driver, while Llama did not forget about it, although it did not go into detail (probably, if I wrote the prompt in English, there would be much more words). We can move on.

Text generation

And our comparison will be completed by generating a creative text, for example, let's ask for an unusual recipe for ordinary borscht (basic):

Write a recipe for borscht, but as if it were instructions for assembling a spaceship.

Llama 3.1 405B:

ChatGPT-4o

Claude-3.5 Sonnet:

All three models, of course, did a good job of rethinking the borscht recipe in the style of a spaceship assembly manual. Llama stands out here for its balance, one model goes more into the atmosphere, the other into the structure, and Llama offers something in between.


Llama 3.1 405B is a quite competitive LLM, demonstrating high results in many tasks. It successfully copes with code generation, solving logical problems and creative text. However, a direct comparison with GPT-4o and Claude 3.5 Sonnet in Russian reveals some weaknesses, for example, in the detailing of explanations or the brevity of the code. Nevertheless, it is worth remembering that Llama 3.1 is an open source model, which means it has great potential for further development and improvement. Open source code will allow researchers and developers from all over the world to contribute to its improvement, which in the future can lead to the creation of even more powerful and accessible AI-based tools.

Thanks for reading! (:

*The Meta organization is recognized as extremist in the Russian Federation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *