Users test LLM abilities through games, particularly Minecraft and Pictionary

Most tests for evaluating AI models are not very informative: they often boil down to simple memorization of answers or touch on topics that are not always relevant to users. That's why some AI enthusiasts use games as a way to evaluate problem-solving skills.

Paul Calcraftan independent developer, has created an app where AI models play a game similar to Crocodile (Pictionary). One model draws, and the others try to guess what is depicted.

Models guess the image. Is it a duck or is it the sun? Image source: Paul Calcraft

Models guess the image. Is it a duck or the sun in the end?
Image source: Paul Calcraft

“I thought this could be an interesting way to evaluate a model’s abilities,” Calcraft shared. “So I decided to spend a rainy Saturday doing this project.”

The idea for this project came to Calcraft after becoming familiar with the work of British programmer Simon Willison, who created a similar test by asking models to draw a vector image of a pelican on a bicycle. Calcraft and Willison chose tasks like these to force models to “think” outside of their training data.

“This is a test that cannot be passed simply by memorizing ready-made answers or patterns from the training data,” Calcraft explained.

Games are becoming a new and more flexible way to test AI models. For example, a 16 year old developer Adonis Singh created tool mc-bench, which evaluates the model's abilities by controlling a character in Minecraft and building various structures. “Minecraft tests the resourcefulness of models and gives them more freedom to act,” he said.

Using games to test AI is not new. Back in 1949, the mathematician Claude Shannon stated that games such as chess are a worthy test for intelligent systems. In recent years, AI systems have emerged from DeepMindplaying Pong and Breakout, and from OpenAI – participating in matches Dota 2.

Nowadays, AI enthusiasts use large language models (LLMs) for games to test their logical abilities. Different models such as Gemini, Claude and GPT-4when interacting, create “different impressions” – this phenomenon is difficult to quantify. “LLMs are known for their sensitivity to question formulation, instability and unpredictability,” Calcraft added.

Please note the typo: there is no Claude 3.6 Sonnet model. Image Source: Adonis Singh

Please note the typo: there is no Claude 3.6 Sonnet model.
Image source: Adonis Singh

Games provide a visual and intuitive way to evaluate how AI performs taskssays Matthew Guzdial, a researcher at the University of Alberta.

“Each test simplifies reality in its own way, focusing on certain types of problems, be it logic or communication,” he noted. “Games are simply another approach to evaluating AI decisions, which is why they are used alongside other tests and methods.”

You can see the similarities between Pictionary and generative adversarial networks (GAN), where one model creates the image and the other evaluates it. Calcraft believes that Pictionary can test models' understanding of concepts such as shape, color, and prepositions (such as the difference between “in” and “on”). While it's not a rigorous test of reasoning, successful play at Pictionary requires the model to have strategy and understand clues—tasks that the AI ​​doesn't always handle easily.

“I like the almost adversarial nature of Pictionary, which is reminiscent of how GANs work, where one model draws and the other tries to guess,” Calcraft noted. “Here, the best artist is not the one who draws the most beautifully, but the one who most clearly conveys an idea that is understandable to other language models.”

Calcraft, however, warns that Pictionary is more of a “toy” test that does not solve practical problems. However, he believes that spatial reasoning and multimodality skills are important elements in the development of AI, and that LLM Pictionary can be a small but significant step in this direction.

Singh also sees Minecraft as a useful tool for assessing LLM's reasoning ability. “The models I tested produced results consistent with my level of confidence in their ability to solve problems,” he said.

 Image Source: Adonis Singh

Image source: Adonis Singh

However, not all researchers support this approach. Mike Cook, a research fellow at Queen Mary University and an AI expert, expresses skepticism about using Minecraft as a testing environment for AI.

“Part of the appeal of Minecraft is its similarity to the 'real world,'” Cook explained. “But fundamentally, the challenges in Minecraft aren't that different from other games like Fortnite or World of Warcraft. “Minecraft just creates the illusion of everyday tasks like building or exploring.”

Cook also noted that even the most advanced AI gaming systems have difficulty adapting to new conditions and challenges that they have not yet encountered. For example, a model trained in Minecraft is unlikely to be as successful in the game Doom. “Minecraft has some features that are useful for AI testing, such as weak reward signals and a procedural world where tasks can be unpredictable,” he added. “But that doesn't make it any more 'real' than other games.”

Despite this, watching language models build castles and interact in games continues to generate genuine interest and enthusiasm.

What do you think about testing LLM using games?

Source

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *