Should I wait for ChatGPT-o1 at home?

experience prompt engineers who found out that if in a request you tell the model how to think about the problem, and offer to write out all the actions explicitly, then the model will solve the problem much better. Literally allowing the model to use its own speech as a draft and as a memory for storing intermediate steps showed the effectiveness of this approach. Later, interpretations appeared that people themselves use their internal monologue and drafts, but these examples do not always end up in the training data.

After this, the initiation of a chain of reasoning through prompting gained recognition as a method and began to acquire modifications:

Method generation of reasoning not only sequentially, but also in parallel, with the subsequent selection of the most promising reasoning.
Method self-criticism and assessment of the quality of the answer by the model itself. Surprisingly, checking reasoning is a much more natural task for LLM than generating it.

As they later discovered for themselves researchersAnd enthusiasts,prompting is not directly necessary for models; it is enough to look carefully at the output probabilities of subsequent tokens. These probabilities are enough to determine that the model is unsure of its answer and try to generate the reasoning again in the hope that the next attempt will come out better.

Naturally, everything that can be in the prompt and can generate correct reasoning can be hardwired into the model itself during the tuning process:

Using a prompt, direct the model to reason step by step, quickly check the reasoning manually and use received kits for further training.
In the same way, collect good and bad examples of reasoning and, based on them, train a critic model (Direct Preference Optimization), and with its help tune the resulting model.
If we're talking about tuning through a critic model, then large companies like OpenAI can afford add human moderation there, evaluating each step in the reasoning process.

Further available articles from large laboratories scatter in different directions, but operate with the same components – generating chains of reasoning, building models for assessing the quality of chains through the correctness of the answer, building models for assessing the quality of individual steps, using these models both for selecting training data and during use . Examples 1, 2, 3, 4.

Now back to the question – does this give hope to local LLM enthusiasts? Most likely yes. Although only OpenAI can boast of such a model for now, we can expect similar tuning methods to be applied to models with open weights in the future. Moreover, the precedent of the o1 release largely changes the industry's view of small models and their purpose. Thus, in an interview, Andrei Karpati says that perhaps we should expect a decrease in the importance of “encyclopedic” functionality in language models, since in the tasks solved by these models, most of the knowledge about the Internet hardwired into them is clearly redundant. And we can expect models of 1-8 billion parameters in size, capable of sound human-level reasoning, but looking to other sources for additional information.

About reduced large models

It is difficult to talk without speculation about the number of parameters of ChatGPT-o1 models, but it is still worth asking the question – if suddenly the weights of a model of this caliber were available to the community, what methods could allow it to fit into a custom GPU? There are not many answers – quantization and distillation.

Model quantization is a reduction in the size of models that initially store parameters with FP32 or FP16 precision by reducing the precision of individual parameters. This technique is not new; quantized GGUF files are already a standard format for distributing models for individual use. With the loss of accuracy, the models also lose quality, but the community is willing to experiment with selecting the size of the initial model and the degree of its compression to solve individual problems. So if the frontier model appears in the public domain, its less accurate analogues will immediately appear. The main thing is that the initial model is of a reasonable size. So far, models like the llama 3 405B are not available to ordinary enthusiasts, not only for GPU, but also for use on the CPU for most consumer machines.

One-bit models can be called the limiting case of models with limited accuracy. Microsoft early last year introduced the concept of one-bit LLMs, demonstrating that for such models to function, only a range of values (-1, 0, 1) is enough. Unfortunately, it is not possible to build such a model through quantization; it must first be trained for a ternary system. Moreover, such a model achieves the same quality as classical LLMs only with a tenfold increase in the number of parameters, that is, using the same amount of memory. The main winners are again those with specialized equipment, so expectations from recently released libraries for binary LLMs you should have extremely modest ones.

Another way to democratize large models is distillation — a method in which a small model is trained not on original data, but on output predictions (logits) of a more powerful teacher model. Such data is a much richer source of information than the text on which the original model was trained. An example of a distilled model would be 1B and 3B versions of Llama 3.2easily accessible for use.

Distillation and quantization offer hope that the closed training methods described in the previous section have a chance to reach enthusiasts in the form of open scales and be applicable on consumer hardware.

Alternative architectures

All the previously listed methods mainly affect the nuances of model training, the data on which it is trained, and optimization of inference, but do not change the classical GPT architecture itself. Therefore, it is interesting to consider what changes could allow the architecture to be adjusted to the features of local inference.

Here we must immediately make one remark that many alternative architectures are now literally in a stalemate. Most ideas that show results on small models do not show results on industrial data volumes, which makes it very difficult to transfer academic research to industry. An example would be an article about Mamba architecture, which did not make it to the conference because the authors “did not check” the scaling trajectories for a model with more than one and a half billion parameters. Therefore, most ideas are naturally in limbo, unable to be either confirmed or refuted.

Among the architectures that have made their way into the light, we can distinguish MoE (Mixture of Experts) models, which are a construction of architecturally identical LLMs called experts, and a router model, which, depending on the context, selects which experts should be selected to predict the next token . The very idea that during token processing the model can use only part of its weights allows not only to speed up the calculation of the model, but also to bypass the limitation on the size of the model, juggling scales experts located at a particular moment in the GPU memory. While capacity is no longer an issue, they are still limited by memory access speed. In unrelated good news, they are already being investigated. methods improving small models by turning them into MoE models.

It seems that to solve the central problem of GPU recycling efficiency, it is necessary to use fewer parameters, but use them several times. A similar idea has been mentioned before in Tree of Thought methods, where multiple reasoning could be generated from the same prompt, creating independent reasoning in parallel. But it would be great if this idea were applied more generally. Thus, the idea of using the same weights has already been used in encoder networks ALBERT And Universal Transformersbut at least it works for GPT is underwayI didn’t find any breakthrough results.

Among the ideas in this direction, I would like to note the idea of introducing “empty” tokens, the main purpose of which is not to provide new information or generate text, but to provide the text model with additional space for storing intermediate data and performing additional calculations. Quite a simple approach proposed in the original articlewhich scatters these tokens randomly throughout the text, is already showing good results, but does not yet answer the question of how to use a similar idea in a more general form.

Conclusion

As you can see, today attempts to transfer large language models to local user devices face serious technical limitations, but the success of quantization and distillation methods allows us to believe that the progress of industrial models will also be available for local use. Research conducted by the academic community and enthusiasts also provides hope that the singularity will not occur behind closed doors.