Briefly about KAN for the most unprepared

In April, research practitioners and mathematicians announced a new neural network architecture. The discovery did not bring any major resonance, although from our point of view, KAN can lay claim to an interesting technology. More importantly, this is not just a new variation of a transformer or a corrected recurrent neural network – it is a new approach to neural networks in principle, a new architecture instead of MLP.

We wrote a large article on KAN with all the details, here we will very briefly go over the main principles of architecture and its problems for the most unprepared readers.

But hopefully some understanding of basic linear algebra and calculus is present.

MLP is a regular fully connected neural network, where, thanks to layer-by-layer activation of neurons, we obtain some final result (data) on the last layer. It looks like an imitation of the action of brain neurons: thanks to the transmission of impulses from neuron to neuron, we get the result in the form of associations, memories…

So the essence of KAN comes down to shifting the emphasis from “activating neurons” to activating “connections” between them.

Of course, the connections themselves don’t go away, but… Now, instead of the usual weight relationships between neurons, we get trainable activation functions – we connect neurons with B-splines. Weights are numerical coefficients that determine the level of activation of neurons. More weight means stronger signal.

Neural networks with a large number of layers turn into a black box. We cannot understand how exactly numerical values are responsible for certain characteristics – and how exactly the characteristics are transformed into them. Therefore, the internal part of the neural network is inaccessible to us.

Let us recall what the classic MLP formula looks like:

The activation and addition function can be thought of as a filter that tells the neuron "to activate or not"... — The activation and addition function can be thought of as a filter that tells the neuron whether to fire or not…

MLP is the basic level of neural network design; other models work on a similar architecture: from generative to recurrent. Therefore, KAN turns the approach around a bit.

The Kolmogorov-Arnold Network (KAN) uses B-splines for function approximation because of their computational efficiency, local support that reduces processing complexity, and ability to provide smooth and flexible function representation.

Approximation here / function restoration is practically the restoration of patterns from the data.

The key word is locality. It is locality that allows us to flexibly rearrange functions – this ensures the “learning” of curves. B-splines allow nonlinear relationships to be accurately modeled, are easy to implement, and offer control over fitting parameters…

Flexibility allows the neural network to train activation functions in such a way as to obtain the best result. Activation functions in neural networks help decide whether a neuron is activated (whether it will transmit a signal further) or not.

To begin with, it’s worth returning to the key task of a neural network – finding patterns and hidden dependencies. A dataset (an ordered set of data) is selected, which becomes the source of input data. At the output, thanks to activations and processing, we get the final result. The main computational problem is approximation of a multidimensional function.

But the “Curse of Dimensionality” is a problem that neural networks face when working with high-dimensional data, that is, with a large number of features or variables. Here we have written about it in more detail.

– we can put many attributes into the arguments of the sought function. And our arguments, or rather their number, is the dimension of the function.

In simple terms, the more dimensions (or features) the data has, the more difficult it becomes to analyze it and build models that perform well.

We want to classify in our neuron according to thousands of characteristics – then offer data Where — number of data points per measurement.

Models can work well on training data, but do not generalize well to new, unseen data, since in high dimensions the probability of overfitting increases significantly.

But what should we do with this superposition of the function (different arguments) and not lose data and not get caught up in crutches? – break the function into simple non-linear functions…

A multivariate function whose many values—literally hundreds of combinations of pairs (if you decide to explore a function with many variables)—can be translated into simple, human one-dimensional real numbers….

We can turn “multidimensionality” into “unidimensionality.”

There is no longer a need to solve problems with tens and hundreds of data dimensions – now we work in one dimension.

This is what the Kolgomorov-Arnold theorem says.

Even more precisely: any function in a superposition can be represented as a sum of nonlinear “one-dimensional functions”.

Probably everyone remembers from school those very functions where you need to find the “domain of definition”…

Such functions can be expanded. The general formula looks like this:

We sum “one-dimensional” functions with two/one variables and get the original function with many attributes/arguments. If you are more interested in the mathematical side of the issue, follow the link.

Coders can now use symbolic formulas.

B-splines in this formula play the role of functions that can be specified in pieces (only those ranges of values that we need), and also flexibly manage them.

It's better to show it visually:

Tangents – thanks to them, changes in splines are constructed.

We moved the tangent and changed the behavior of the graph of a nonlinear function, B-spline. Tangents, as we remember from school, are constructed through linear functions On graphs they show the rate of decrease or increase in function values.

Calculation of a specific speed at a single point is the derivative of the function. Naturally, in order to adequately change splines, we need this “speed” of the graph to change evenly without jumps. Functions where the derivatives change sequentially are, roughly speaking, called smooth.

But here comes a key problem. And who said that when expanding a function of many variables using the Kolgomorov-Arnold theorem, we will only get the sum of nonlinear and smooth functions?

One of the problems with KAN is that we can obtain fractal functions in the sum of a multidimensional function.

Fractal functions whose range of values extends to complex numbers. Complex numbers are an extended version of real/ordinary numbers. Often, fractal formulas are complex and intimidating; it is difficult to calculate them even without taking complex numbers into account.

On the other hand, this approach, according to the theorem, offers us the opportunity to rewrite the standard MLP and approach neural networks from a new angle – now we look for patterns by training B-spline activation functions.

They are flexible and easy to change due to their locality, and actively train other activation functions.

It turns out that the standard formula of neural networks has been rewritten, and the process of approaching the original function (pattern) becomes more effective. Benchmark tests have shown increasing efficiency by a factor of 10, but we sacrifice the “speed” of learning, which decreases proportionally – by the same 10 times.

Of course, it is interesting how such an architecture is implemented on the popular frameworks Keras and TensorFlow. We have written a large material where we consider architecture in more detail. There we left the option of implementing the architecture on PyTorch and an example of work from the official KAN Github on the PyKAN library, specially tailored for the Kolmogorov-Arnold neural network.

If you want to get straight to the point based on examples, go to repository KAN on Github.

And go straight to notepad with code and explanations or read documentation.

For mathematicians, the full version of the article can be read in English – here.