Large language models - a race to a dead end or a breakthrough into the future?

I return to the topic of my favorite large language models (LLM, BYAM). Observations of the industry, events and dynamics in recent months clearly show a movement with increasing acceleration towards a dead end. The finish can be spectacular. Where are such conclusions from? Let's take it in order.

For those who actively use BNM in their work, especially if this work is not just writing texts, but more serious analytical tasks, writing code, they have probably noticed that they clearly lack the ability to abstract and be systematic. They constantly strive to get hung up on details – a good illustration is trying to debug code. They cope well with minor errors, but if the error is systemic, in the logic of the code, in the data structure, then, as a rule, they cannot cope. It’s the same with analytics tasks – they cope well with junior tasks, but more serious levels cause difficulties. Let’s note this fact to ourselves and move on.

The biggest drawback of the BYM neural network, in my humble opinion, is that its structure is static. This is the human brain – the structure is dynamic, and the structure of the BNM was formed from the very beginning, the number of layers, their width, the number of input and output parameters were put into it, and nothing can be changed, only taught. Further, during the learning process, conditional “images” and concepts are formed within the network. Some of them can be compared with the words of the language known to us (which is successfully done by some fans of the anatomy of the NML), while some probably have no analogues, since they represent more complex abstractions. But let’s note two key parameters of a neural network: width and depth.

Depth is the number of layers of the neural network. This parameter determines how great its ability to abstract is. If at the input of the model we have lower-order abstractions – tokens (parts of words, symbols), then deep in the model we already have a vector representation of complex concepts. Insufficient depth of the model entails the same problem with the inability for deep system analysis, superficiality, which is often encountered in practice and what we talked about at the very beginning.

Width is the number of neurons in the conditional layer. This parameter determines the number of representations that the neural network can operate on at a particular level. The more there are, the more fully they can reflect the ideas of the real world, a reflection of which, in fact, the BYM is. What happens if the width of some network layer is not enough? It will not be able to fully form the conceptual apparatus of this level of abstraction, as a result – errors, substitution of concepts with similar ones, which entails a loss of accuracy or hallucination. What happens if the width is excessive? Difficulty in forming a conceptual apparatus, its blurriness and, as a consequence, loss of accuracy. But in practice, as I see it, the first option is much more common.

The key problem is that we do not know for sure what the width of each specific layer and the depth of the entire model should be. These parameters are dynamic in a living brain, since they depend on information received during the learning process: neurons are formed and die, connections change. But this is the architecture of the artificial neural networks used – they are static, and the only option is to set the width and depth larger, with a margin. True, no one can guarantee that this will be enough for a specific Nth layer. But this gives rise to a number of problems.

1. If increasing the depth of the model linearly affects the number of parameters in it, then increasing the width of the layer has a power-law effect. Therefore, we can observe how the size of the top BNM models exceeds a trillion parameters, but comparing them with models 2 orders of magnitude smaller in size does not show such a significant difference in the quality of generation. And since the further growth of the models is gradual, we can see with our own eyes how industry leaders are hysterically increasing computing power, building new data centers and frantically solving the issues of power supply to these monsters. At the same time, increasing the quality of the model by a conditional 2% requires an increase in computing power by an order of magnitude.

2. The unbridled growth in the number of model parameters requires a huge amount of training data. Moreover, high-quality data is highly desirable. And this is a big problem. Already now there is a question about the artificial generation of new training data, since the natural ones are already running out. An attempt to pump up the model with everything that comes to hand gives rise to new problems: a drop in the quality of generation, bias, etc.

3. During the training process, all model weights are completely recalculated for each iteration, for each token supplied. This is a catastrophic inefficiency. Imagine that, when reading a book, to read each subsequent word you would have to re-read it from the beginning! (Yes, the comparison is incorrect, but it most clearly reflects the scale of the problem). Moreover, when the BYM operates, almost all model weights are also recalculated to generate each output token.

4. As computational complexity increases, the problem of parallelism arises. The overhead of computing power does not grow linearly with the size of the models. Communication between individual cluster nodes introduces its own delays. Of course, new developments in accelerators with larger amounts of memory and optimization partly help solve the problem, but only partly, since the models themselves are growing at a much faster pace.

These are only some of the problems that arise, but they are the most acute. And these problems are quite obvious to those involved in the development of nuclear materials. Then why, with such persistence, passion and increasing acceleration, are they rushing towards a technological dead end? The answer is quite simple. Undoubtedly, the BNM technology has shown its capabilities and, at this technological level, it is quite capable of creating a system close to or maybe even superior to a human. And whoever does this first will conditionally invent a new atomic bomb, an absolute weapon that will give a new technological impetus, perhaps help develop a new, more efficient architecture and, as we approach the end of the impasse, will be able to make a quantum leap and overcome this potential barrier. Maybe… Or maybe it won't work out. And although market leaders are full of optimism, we may witness another financial disaster, a new dot-com crash squared off. This will happen at a time when the next conditional GPT5 does not live up to high hopes, and the resources for creating GPT6 will no longer be measured in billions, but in hundreds of billions or trillions of dollars. We were recently surprised by the words of Sam Altman when he voiced such astronomical estimates of the resources he wants to attract. And he knows what he's talking about.

But let's return to earth. We are in Russia, there are technological sanctions against us. Industry leaders Sber and Yandex with their models are trying to create something, but we see that… however, let’s not talk about sad things. Is there a way out? There is always a way out, sometimes even more than one. It is possible, of course, that some developments are underway (even probably underway), but fundamental things, such as new architectures of neural networks in particular and artificial intelligence systems in general, are not being created quickly. And the market leaders are in a race, counting for months, they have no time for new architectures, they are squeezing the maximum out of what they have. We definitely won’t be able to keep up with them, so we need to go a different way. Let's not consider exotic technologies like quantum computers – this is a matter of the distant future. After all, sometimes, in order to come up with something new, you just need to remember the well-forgotten old one. For a long time, the development of AI technologies followed the path of deterministic models, expert systems, fuzzy logic systems, etc. Against their background, the technology of semantic networks stands out, where nodes are concepts, and connections are relationships between them (in fact, to a certain approximation, modern LLMs are semantic networks, only non-deterministic). We add a superstructure to it in the form of a hierarchical structure to abstract concepts. The structure itself can be made dynamic so that nodes and connections are created during the learning process. The training of the model and its functioning will be implemented on the basis of agent technologies. Agents, according to given rules and based on the internal state, move through the network graph and make point changes (training) or collect information, forming a response to a request. The agent-based approach does not require a complete recalculation of the entire network and is perfectly parallelizable, without requiring colossal computing power.

That's all, thanks to those who finished reading) As always, I will be glad to receive meaningful comments, remarks and ideas!