Can Elon Musk's AI outperform the competition?

In mid-August we saw announcement beta versions of Grok-2 and Grok-2 mini from Elon Musk’s startup xAI, and recently they provided an API for their model. Well, it’s autumn, as Anacondaz sang “there’s a dubak on the street, there’s decay and darkness in the heart,” so let Grok 2 today become the one who will cover our bodies with a censored blanket. I suggest we start.

Happy reading!

A little about Grok-2

Let's start with a short introduction about the brainchild of Elon Musk. On August 13, 2024 xAI presented Grok-2. According to the developers, this family of models has the most advanced capabilities in the field of reasoning, code and chat. xAI went one step higher compared to their previous model Grok-1.5 (well, generally as usual).

According to the overall ELO rating, the beta version of Grok-2 (in the graph called sus-column-r) is ahead of Claude 3.5 Sonnet and GPT-4o mini on the LMSYS leaderboard:

The xAI blog has some interesting information about how they evaluate their models. It's simple: imagine AI mentors, who, like strict teachers at school, give the models tasks that reflect real situations in which Grok could find himself. Each task consists of two answers from the model, and the mentor chooses the best one, guided by special criteria.

The developers decided to focus on two important aspects: how the models handle instructions and how accurate and factual the information they provide. Grok-2 has really improved. Now he has a better understanding of the extracted information and knows how to use the tools. For example, it cleverly finds missing data, reasons about the sequence of events, and even filters out garbage that is not relevant.

Now let's talk about academic tests. The model not only chats, but can also compete with smart people in logic, reading comprehension, science, mathematics and even programming. Grok-2 and its younger brother Grok-2 mini showed their best performance and outperformed its predecessor Grok-1.5. In science, they show graduate level (GPQA), maintain general knowledge at the level (MMLU, MMLU-Pro), and also excel in mathematical problems (MATH).

But that's not all. Grok-2 also excels at tasks based on visual content. For example, it showed leading results in Visual Mathematics (MathVista) and Document Question Answering (DocVQA). So Grok-2 is not just a chatterbox, but a real universal soldier in the world of AI.

Benchmarks

I can sing Grok's praises for a long time, but that's not what we came here for. I propose less words and more action, let's run the Grok-2 through tasks, and also compare it with other models, namely, Claude 3.5 Sonnet, Gemini Pro 1.5 Exp / Gemini Pro 1.5 and GPT-4o.

Code

My promt:

Write a function in Python that analyzes sales data. You need to implement a function that accepts a list of dictionaries, where each dictionary contains information about the product, its category, sale date and sale amount. The function should:

1. Return the total sales amount for each month;

2. Return the category with the highest number of sales for each month;

3. Determine which month had the highest total sales amount and display the corresponding amount and month;

Data is submitted in the following format:

sales_data = [

{‘product’: ‘item1’, ‘category’: ‘A’, ‘sale_date’: ‘2024-01-15’, ‘sale_amount’: 100}, {‘product’: ‘item2’, ‘category’: ‘B’, ‘sale_date’: ‘2024-02-10’, ‘sale_amount’: 200}, {‘product’: ‘item3’, ‘category’: ‘A’, ‘sale_date’: ‘2024-01-20’, ‘sale_amount’: 150}, {‘product’: ‘item4’, ‘category’: ‘C’, ‘sale_date’: ‘2024-03-05’, ‘sale_amount’: 50},

]

Yes, I will make a small remark: the statements will be in Russian, and I will not go into detailed comments on evaluating the results of the models, so as not to overload you with my subjective thoughts and to give you the opportunity to evaluate this or that model independently of me.

Grok-2

Claude 3.5 Sonnet

Gemini Pro 1.5 Exp

GPT-4o

The results are before your eyes; in general, you can see that Grok coped with the task quite well: the code is understandable, fulfills all the requirements that were set for it. Except that it is a little inferior to Claude, since Claude’s code is much more concise and compact. And, of course, it’s worth mentioning about Gemini with GPT – in my opinion, they coped with the task much worse than Grok and Claude, firstly, the code in the response is identical for both models, and, accordingly, repeated shortcomings, for example, code readability or not a very convenient format of the returned data. In general, in my opinion, the Grok-2 beta can work very well with the code, we can put a plus sign and move on to the next task.

Understanding instructions

My promt:

1. Take the following sentence:

“In 2024, Alpha Inc. increased its profits by 25%, Beta Ltd. – by 10%, and Gamma Corp. showed a loss of 5% compared to the previous year.”

2. Based on this information, calculate how much profit each company had in 2023 if their profits in 2024 were: Alpha Inc. — $125,000, Beta Ltd. — $220,000 and Gamma Corp. — $95,000

3. Then write the output in the format:

“Alpha Inc. had a profit in 2023: $X

Beta Ltd. had a profit in 2023: $Y

Gamma Corp. had a profit in 2023: $Z”

4. After this, rewrite the original sentence, adding information about the companies' profits in 2023.

5. Analyze the dynamics and write a conclusion about which company showed the greatest increase in profits and which showed the most negative dynamics.

Grok-2

Claude 3.5 Sonnet

Gemini Pro 1.5 Exp

GPT-4o

I can’t help but note that Grok-2’s responses stand out due to their structure. I myself usually turn to Claude 3.5 Sonnet to perform certain tasks, and I can’t help but note how Grok provides information and in what form. And this, by the way, not only sets it apart from other models, it makes it overall the best in this task, plus here we also see informative and understandable calculations, analysis of dynamics and strict adherence to instructions. An interesting experience, I think we can move on.

Reasoning/logic

You are in a game with three chests. One of them contains gold, the rest are empty. There is an inscription written on each chest, but only one of the inscriptions is true, the other two are false. This is what is written on the chests:

Chest 1: Gold is not in Chest 2.

Chest 2: The gold is in this chest.

Chest 3: Gold is not in this chest.

Question: which chest contains the gold?

*Starting with this task, I changed the Gemini Pro 1.5 Exp model to Gemini Pro 1.5

Grok-2

Claude 3.5 Sonnet

Gemini Pro 1.5

GPT-4o

It seems to me that Grok-2 and Claude 3.5 Sonnet again stand out in this task due to, firstly, the correct solution, and secondly, the structuring of information. However, there is also a difference in presentation between Grok and Claude – Grok tries to describe everything in as much detail as possible, while Claude offers an equally clear solution, but more concise. Grok also bypasses GPT because its solution is more visual and powerful, although GPT’s answer is also correct. Well, Gemini made a mistake, which again selects Grok-2, we put a plus sign and we go to the last task.

Mathematics

You have 150 apples. You decide to divide them between three friends, with the first friend receiving 20 more apples than the second, and the third receiving twice as many as the second.

Question 1: How many apples does each friend get?

Question 2: If the first friend decides to give 10 apples to the second friend, how many apples will each friend have after that?

Question 3: What will be the total amount of apples for all friends after the exchange, if they decided to keep all the apples?

Grok-2

Claude 3.5 Sonnet

Gemini Pro 1.5 Exp

GPT-4o

Here I would like to note that the answer is no longer Grok-2, but Gemini Pro 1.5, although the model also continued to solve the problem, it was the only one that noticed that it was working with fractions, although the problem should involve a solution in whole numbers. But in general, Grok is at the level of Claude and GPT here, that is, the solutions are correct, if you do not take into account the fact that the task was formulated incorrectly.


So, today we went over the new Grok-2 model and, in my opinion, it shows itself very well, of course, you can test it on more tasks, but I didn’t want to drag it out too much and highlighted the main points that you usually want in First of all, look at this or that model.

I would like to note that we worked with Grok-2 through BothHubbut it can also be tested through the X platform with extended functionality, for example, evaluating the vision function and generating images with FLUX, but this is only available to users who have a VPN and a subscription.

Thank you very much for your attention! What do you think about Grok-2?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *