OpenAI o1 Results, Testing, and Notes on the New Model

o1-previewOver the past 24 hours, we have gained access to newly released OpenAI models, o1-minispecially trained to emulate reasoning. These models are given extra time to generate and refine reasoning tokens before giving a final answer.

Hundreds of people have asked how o1 fares on the ARC Prize. So we tested it using the same basic test system we used to evaluate Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5. Here are the results:

Is o1 a new paradigm for AGI? Will it scale? What explains the huge difference between o1's performance on IOI, AIME, and many other impressive benchmark results compared to its modest results on ARC-AGI?

We have a lot to talk about.

Chain of thoughts

o1 fully implements the chain of thoughts (CoT) paradigm “let’s think step by step”, applying it both during training and time testing.

Source: OpenAI

In practice, o1 makes significantly fewer errors when executing tasks whose sequence of intermediate steps is well represented in the synthetic CoT training data.

OpenAI says it created a new reinforcement learning (RL) algorithm and a highly efficient data processing process using CoT during the training phase.

The underlying source of o1 training is still a fixed set of pre-training data. But OpenAI can also generate tons of synthetic CoTs that emulate human thinking to further train the model using RL. The question remains, how does OpenAI choose which generated CoTs to train on?

Although we have few details, reward signals for reinforcement learning were likely achieved through verification (in formal domains such as mathematics and coding) and human labeling (in informal domains such as task breakdown and planning).

During inference, OpenAI says that they use RL to allow o1 to hone its CoT and improve the strategies it uses. We can assume that the reward signal here is some kind of actor + critic system, similar to those OpenAI published earlier . And that they apply search or backtracking to the generated reasoning tokens during inference.

Test computing time

The most important aspect of o1 is that it demonstrates a working example of applying CoT reasoning search to an informal language, rather than to formal languages ​​such as mathematics, coding, or Lean.

While the added scaling of training time using CoT is noteworthy, the big novelty is the scaling of testing time.

We believe that iterative CoT actually unlocks greater generalization. Automatic iterative re-hinting allows the model to better adapt to novelty, similar to the testing-time fine-tuning used by the MindsAI team.

If we only make one output, we are limited to reusing learned programs. But by generating intermediate output CoTs, or programs, for each task, we open up the possibility of composing components of learned programs, achieving adaptation. This technique is one way to overcome the #1 problem of generalizing a large language model: the ability to adapt to novelty. Although, like fine-tuning during testing, it ultimately remains limited.

When AI systems are allowed a variable amount of computation during a test (e.g. number of reasoning tokens or search time), there is no objective way to report a single benchmark score because it is relative to the computation allowed. This is what shows this diagram .

More calculations – more accuracy.

When OpenAI released o1, they could have allowed developers to specify the amount of computation or time allotted to refine the CoT during testing. Instead, they “hard-coded” a point along the computation continuum during testing and hid that implementation detail from developers.

With changing computing capabilities during testing, we can no longer simply compare the outputs between two different AI systems to assess relative intelligence. We also need to compare efficiency calculations.

While OpenAI did not share efficiency figures in its announcement, it is exciting that we are entering a period where efficiency will be a focus. Efficiency is critical to AGI definitions, and that is why the ARC Prize sets a performance cap for winning solutions.

Our forecast : In the future we will see many more test graphs comparing accuracy and test execution time.

Basic lines of the ARC-AGI-Pub model

OpenAI o1-previewAnd o1-miniboth are superior GPT-4opublicly available ARC-AGI evaluation dataset. o1-previewIt's roughly on par with Anthropic's Claude 3.5 Sonnet in accuracy, but takes about 10 times longer to achieve Sonnet-like results.

To obtain the base model score in the ARC-AGI-Pub leaderboard, we use the same basic request, which we used for testing GPT-4oWhen we test and report results on pure models like o1, we intend to get a measure of the performance of the base model as best as possible without imposing any optimization.

In the future, others may find more efficient ways to create CoT-style models, and we'll be happy to add them to the leaderboard if they're tested.

The increase in o1 performance came at a cost in time. It took 70 hours to run 400 public tasks, compared to 30 minutes for GPT-4oand Claude 3.5 Sonnet.

You can use our open source kaggle notebook as a basic test harness or a starting point for your own approach. Submitting SOTA to a public leaderboard is the result of smart methods in addition to advanced models.

Perhaps you can figure out how to use o1 as a foundational component to achieve a higher result in a similar way!

Is there AGI here?

On this diagram OpenAI shows a log-linear relationship between accuracy and AIME computation testing time. In other words, as computation grows exponentially, accuracy grows linearly.

Many are asking a new question: how big can this be?

The only conceptual limitation of the approach is solvability problem posed to the AI. As long as the search process has an external verifier that contains the answer, you will see that the accuracy increases logarithmically with computation.

In fact, the results presented are extremely similar to one of best approaches Ryan Greenblatt's ARC Prize. He achieved a 43% result, generating GPT-4ok=2048 solution programs for each problem and deterministically tested them using demonstration problems.

He then assessed how the accuracy changed for different values ​​of k.

Ryan found an identical log-linear relationship between accuracy and testing time for ARC-AGI computations.

Does all this mean that AGI is already here if we just scale the computations during testing? Not quite.

You can see similar exponential scaling curves by looking at any brute force search, which is O(x^n). In fact, we know that at least 50% of ARC-AGI can be solved with brute force and zero AI.

To beat ARC-AGI in this way, you would need to generate over 100 million solvers for each problem. Practicalities alone rule out O(x^n) search for scalable AI systems.

Moreover, we know that this is not how humans perform ARC tasks. Humans do not generate thousands of potential solutions; instead, we use the perceptual network in our brains to “see” a few potential solutions and deterministically test them using System 2-style thinking.

We can become smarter.

New ideas are needed

Intelligence can be measured by looking at how well a system converts information into action in a situation space. This is the conversion rate, and so it approaches a limit. Once you have perfect intelligence, the only way to progress is to gather new information.

There are several ways in which a less intelligent system can appear more intelligent without actually being more intelligent.

One way is for a system to simply remember the best action. Such a system would be very fragile, seeming intelligent in one area but easily failing in another.

Another way is trial and error. A system may seem intelligent if it eventually gets the answer right, but not if it takes 100 tries to get it right.

Future research into testing-time computation is expected to explore how to scale search and refinement more efficiently, possibly using deep learning to guide the search process.

However, we do not believe that this alone explains the large gap between o1 results on ARC-AGI and other objectively challenging tests such as IOI or AIME.

A more convincing way to explain this is that o1 still operates mostly within the distribution of its prior training data, but now includes all newly created synthetic CoTs.

Additional synthetic CoT data increases the focus on the distribution of CoTs, rather than just the distribution of answers (more computation is spent on how to get to the answer, rather than on what the answer is). We expect systems like o1 to perform better on tests that involve reusing known emulated reasoning patterns (programs), but will still struggle to solve problems that require synthesizing entirely new reasoning on the fly.

Test refinement on CoT can only correct reasoning errors so far. This also explains why o1 is so impressive in certain domains. Test refinement on CoT gets an extra boost when the base model is pre-trained in a similar way.

Neither approach alone will give you a big jump.

To summarize, o1 represents a paradigm shift from “memorizing answers” to “memorizing reasoning”, but is not a departure from the broader paradigm of fitting a curve to a distribution with the goal of improving performance by fitting everything inside the distribution.

Do you have any ideas on how to take these new ideas further? What about CoT with multimodality, CoT with code generation, or combining program search with CoT?

Let me remind you that everyone can follow the news in the world of AI in my channel

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *