a simple task that exposed the weaknesses of all language models

A recent study by a group of experts from leading institutions has found significant deficiencies in the logical abilities of even the most advanced LLMs. Article “Alice in Wonderland” demonstrates that modern language models demonstrate unexpectedly low efficiency when solving elementary logical problems.

Large language models (LLMs) such as the closed GPT-3.5/4, Claude, Gemini, or the open LLaMa 2/3, Mistral, Mixtral, and more recently Dbrx or Command R+ are often described as instances of foundational models — that is, models that transfer knowledge effectively across different tasks and settings when trained with few or no examples, while exhibiting scaling patterns that predict performance improvement with increasing amounts of pre-training. These claims of success on different functions and tasks are based on measurements taken on different standardized test sets showing high performance for such models.

We demonstrate here a dramatic decline in the functional and reasoning abilities of state-of-the-art models trained at the largest available scales and claiming strong functionality using a simple, short, common-sense common sense task (the AIW problem) formulated in a concise, natural form that is easily solved by humans. The decline is dramatic because the models exhibit strong oscillations even for small variations in the task that should not affect its solution, and also express strong overconfidence in incorrect solutions, often reinforced by plausible-sounding confabulations.

Various standard interventions to try to get the right solution, such as various types of improved hints or forcing models to reconsider incorrect solutions through multi-stage re-evaluation, fail. We present these initial observations to the scientific and technological community to stimulate an urgent re-evaluation of the claimed capabilities of the current generation of LLMs. Such a re-evaluation also requires a concerted effort to create standardized tests that would properly detect such basic flaws in reasoning that apparently manage to remain undetected by current assessment and testing procedures.

The “common sense problem” called the “AIW problem” in the article is actually simple: “Alice has N brothers and M sisters. How many sisters does Alice's brother have?” The values of N and M (always natural numbers) and the order in which siblings are mentioned can vary. The researchers used different combinations of numbers and types of prompts to get a precise picture of how different models cope with systematic variations in the AIW problem. Regardless of the variations, the structure of the problem remains the same, and the correct answer is always obtained by adding Alice to her sisters (M + 1). This logic is already understandable to most primary school-aged children. In contrast, the language models were able to solve the AIW problem only sporadically, if at all. Here is an example of a response from a language model faced with the simplest version of the question:

***”Alice has four brothers and one sister. How many sisters does Alice's brother have?” – ***“A classic out-of-the-box thinking puzzle! Let's take a closer look: Alice has four brothers. Alice also has a sister (herself). So each of Alice's brothers has a… (drumroll)… sister, and that sister is Alice herself! So the answer is: each of Alice's brothers has one sister.”

This may seem plausible, but it is still wrong (Alice's brothers actually have two sisters). Other language AIs tested also have problems – serious problems, depending on the question. They sometimes get caught up in absurd reasoning, repeatedly arriving at incorrect results and declaring them “correct”. So it's not just the false results that are problematic, but also the fact that the AIs use pseudo-intelligent arguments to support their answers. Even interventions by researchers to prompt them to critically reconsider their answers don't help. Accordingly, the researchers note: “[…] models also express strong self-confidence in their incorrect decisions, while providing often nonsensical explanations that sound like “reasoning” […] to justify and reinforce the credibility of their obviously unsuccessful answers by making them plausible.”

More than every second answer is wrong

Overall, the language models (LLMs) had average correct rates well below 50%, with larger models generally performing significantly better than smaller models (GPT-4, for example, has a correct rate just above 60%). This again highlights the benefits of scale, but even the largest models are not performing well enough for systems that claim to have robust basic reasoning. AI's Achilles Heel: The Simple Task That Exposed the Weaknesses of Language Models”

Particularly telling is the large fluctuations in results even with minor variations of the AIW problem. This is a clear sign that the models are not capable of robust basic reasoning, leading to confusion even when faced with minor variations in the problem that should not affect the correctness of the solution. The more difficult version of the question (“AIW + problem”) eventually pushed all the models to the limit of their reasoning abilities.

According to the researchers, many of the models tested performed well on a variety of standardized tests designed to test different abilities, including reasoning, but failed the very simple AIW task. The researchers therefore suggest in their paper that these benchmarks misrepresent the weaknesses of these models’ basic reasoning, and question the use of current standardized tests to compare models.

Language models on the test bench

Although the paper has not yet been peer-reviewed, its results are already generating a wave of interest. How effective are LLMs really? What does it mean for the use of LLMs if they fail to meet primary school-level requirements? Co-author Zhenya Dzhitsev (JSC) says, “We are overwhelmed with discussions and inquiries as a result of our paper.”

The findings of the researchers challenge many of the established ideas and make further research into the competence of language models absolutely necessary. Jitsev adds: “Our paper provides extremely important new insights into the actual ability of language models to make correct inferences following proper underlying reasoning. Further research is needed here to understand how and why the underlying reasoning in current models breaks down on such simple tasks.”

A very long thread of discussions of the article with “chewing” of the main points – to the forume

All this and much more – TG “Mathematics is not for everyone“