LLM magic session with revelation

A group of AI researchers described their extensive experiment involving the most prominent LLMs. They concluded that all models performed dramatically poorly on common-sense reasoning tasks that are easily solved by ordinary people.
The intellectual abilities of LLMs are greatly exaggerated and the tests do not reflect the depth of real problems.
Is it really that sad?
Without claiming to be universal, I decided to conduct a similar mini-study, only on a limited scale, in order to confirm or refute this frightening conclusion, albeit in one particular case.
And, as it turned out, not everything is so simple and, as they say, there are nuances.

Today on stage is an AI that will solve the mind-boggling problem of the brothers and sisters of the girl Alice.
With the help of his neural network magic, he can easily guess the answers to any question configuration, no matter how cunning experimenters try to confuse him.

But then, as always, something went wrong.

We have before us a serious scientific article”Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models“, in which a group of AI researchers described their extensive experiment involving the most famous LLM. They did a really great job of collecting and analyzing numerous materials that allowed them to draw the following reasonable conclusion.

Large language models (LLMs) such as closed weight models GPT-3.5/4, Claude, Gemini, or open weight models such as LLaMa 2/3, Mistral, Mixtral and Command R+ are often described as examples of basic models.
Here we demonstrate the dramatic deterioration in functionality and reasoning of state-of-the-art models, trained at the highest scale available, that claim strong functionality using a simple, short, common common sense problem formulated in simple natural language that is easily solved by humans. The situation is dramatic because models also exhibit strong overconfidence in their poor decisions, often providing nonsensical “reasoning”-like explanations akin to confabulations to justify and reinforce the credibility of their apparently poor answers by making them seem plausible.
Various standard interventions in attempting to obtain correct solutions, such as various types of extended hints or encouraging models to reconsider incorrect solutions through multi-step re-evaluation, fail.
Given these observations and conclusions, we conclude that the ability of the current generation of SOTA LLMs to perform even simple reasoning on common sense tasks is severely compromised and that current language model tests, especially those aimed at measuring reasoning abilities, do not adequately reflect such flaws.

What plunged the authors into such pessimism and anxious state of mind?

The authors used models to solve the following problem with different N and M: “Alice has N brothers and M sisters. How many sisters does Alice’s brother have?”
She has a simple common sense solution which assumes that all siblings have the same parents.

The results show that most models fail severely, with many failing to produce a single correct answer and most failing to obtain a correct answer rate above p = 0.2. The only major exceptions to this basic observation of reasoning failure are the largest scale closed models GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). These two types of models provide correct results well above p = 0.3, leaving the remaining models with open weights (e.g., Mistral-7B, Mixtral, Qwen, Command R+ and Dbrx Instruct) and models with closed weights (e.g., Gemini Pro, Mistral Large) is far behind.

Now let's conduct our own experiment

I will use the fine tuning version of the Gemma-2 27B model for experiments.

Let's take 4 prompts from the article to display only numbers.

Alice has 3 brothers and she also has 6 sisters. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: “### Answer: “

Alice has 2 sisters and she also has 4 brothers. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: “### Answer: “

Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: “### Answer: “

Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: “### Answer: “

And we will conduct 10 queries with each of them.
The full results at first glance completely confirm the conclusions of the authors of the article. Correct answer rate p = 0.1.

Now let's run 2 more series of similar queries, but with some changes.
First, we will create queries that are completely identical in meaning, but in Russian.

Olga has 3 brothers and 6 sisters. How many sisters does each of Olga's brothers have in total? Answer with only one number.

Olga has 2 sisters and 4 brothers. How many sisters does each of Olga's brothers have in total? Answer with only one number.

Olga has 4 sisters and 1 brother. How many sisters does each of Olga's brothers have in total? Answer with only one number.

Olga has 4 brothers and 1 sister. How many sisters does each of Olga's brothers have in total? Answer with only one number.

The full results are below and it's a radical change from the previous series.

Correct answer rate p = 0.9.

Now we’ll simply reformulate the original prompts, completely preserving their meaning, making them more understandable for the model.

There are brothers and sisters in the family. One of the sisters is named Olga. She has 3 brothers and 6 sisters. How many sisters does Olga's brother have? Answer with only one number.

There are brothers and sisters in the family. One of the sisters is named Olga. She has 2 sisters and 4 brothers. How many sisters does Olga's brother have? Answer with only one number.

There are brothers and sisters in the family. One of the sisters is named Olga. She has 4 sisters and 1 brother. How many sisters does Olga's brother have? Answer with only one number.

There are brothers and sisters in the family. One of the sisters is named Olga. She has 4 brothers and 1 sister. How many sisters does Olga's brother have? Answer with only one number.

The full results are below and this is again radically, I’m not afraid of this word, dramatic, as the authors of the article like to put it, different from the original results.

Correct answer rate p = 0.95.

And how can we understand all this?

Dialogue from the film “The Twentieth Century Begins.”

Holmes: – How do you understand this, Watson?
Watson: – How do you understand this, Inspector?
Lestrade: – How do you understand this, Pitkin?

There may be several explanations for this magical phenomenon.

Firstly, I used the Russian language, which, as you know, is “great and powerful” and therefore a well-formulated prompt in it forces the model to really show its intelligence, and not play the fool :). Perhaps other languages would give similar results.

Secondly, rich in expression possibilities, the Russian language allows you to more accurately, accurately and unambiguously formulate and specify the prompt, making it more understandable in the context of the task itself. Maybe the authors should also simply formulate their prompt more precisely, for example, as I did in experience 3.

Thirdly, perhaps the model itself accidentally turned out to be uniquely suitable for solving this particular problem. Well, there are various rare coincidences in science.

Personally, I like the first option better, since it is closest to the author’s prompts. I can imagine the faces of the authors of the article if the top one on their chart was occupied by the usual average model, which passes all their tests with a coefficient of 0.9 correct answers in Russian.
And the authors of the article shout something like this: “Wow, that's just fantastic! I cannot believe this!”

I liked this study itself. The authors carried it out gracefully and effectively. But why they didn’t take this step and study the problem they raised more deeply remains unclear to me. If they had supplemented their remarkable research with equally useful results, they would certainly not have so categorically placed emphasis on the dramatic degradation of the models' ability to reason with common sense.

Because models, contrary to such conclusions, can quite reasonably and adequately solve such problems if strict rules are followed. And, as shown in this article, the result very much depends on the quality of the prompt, its accuracy and unambiguous wording, the absence of omissions and any understatement in it. Unlike humans, models are extremely sensitive to the structure of the prompt and in similar cases can demonstrate both outstanding intellectual abilities and epic failure.

The authors made a hasty generalization from incomplete data and therefore their conclusions are only partially correct. However, they clearly showed the presence of the problem and the models certainly need to increase their resistance to prompts, as is typical for all people.

Whether I was able to clarify the situation or whether everything became even more confused is completely unclear.
But I can say with confidence that the cognitive abilities of models are developing very quickly. I have a test list of problems that I have designed to evaluate the models' ability to reason and make non-stereotypical inferences under non-standard conditions. And by testing with different models, I can see how quickly their overall level improves. Models become truly cognitively similar to humans in the broadest sense of the word.

With its own advantages and disadvantages. And communicating with them becomes more and more interesting and exciting.