Visualizing music with real-time neural networks… In a sense…

At the moment (a mandatory caveat when talking about neural networks), it takes at least a second for AI to generate one image. Half a second at best. For video, we need at least 15 frames per second (and all thirty would be better). Therefore, it is impossible to create real-time video with a neural network and thus visualize music.

Now that we understand this, let’s think about how to do it.

By the way, I started writing this article in anticipation of the beginning of the concert, at which the video installation changed with the sounds of music. Here’s how it went.

The easiest way to understand how to generate such content is to imagine a two-dimensional section of a four-dimensional cube. So I thought to start this description, but then I decided that it was somehow too much. So let’s do this:

To visualize music, it needs to be presented in the form of some parameters. For example, the amplitude of a wave at certain frequencies. The parameters can be very different and there can be an arbitrary number of them. The parameters change over time, and each moment in time must have its own image. It is easy to get the dependence of the image generated by the neural network on the parameter. In fact, we set a lot of such parameters during generation. And there are such parameters, with a small change in which the picture also changes slightly. Many SD scripts are based on this. For example, plugin shifting attention smoothly changes the image when the weight of one or another token changes in the prompt. And these are exactly the parameters we need. But, as we remember, the main difficulty lies in the fact that the generation of these frames takes much more time than we can afford. The way out is obvious. You need to generate possible frames in advance. For each of the possible parameter values. For example, take a drawing of a flower. One parameter will control its height, and the other – its appearance. For example, how it looks like a blot.

So we get a set of images sorted along two axes. Such a rectangle, each point of which is a picture. The pictures are two-dimensional, therefore, as a whole, a four-dimensional cube is obtained – a tesseract. Such a tesseract is our visualization, potentially containing all possible options for parameters. While the music is playing, two parameters change, and these two parameters set a specific picture at any given time. This is how the music extracts a sequence of frames from the tesseract, which produces the visualization. In real time, mind you.

In this video, you can see how the parameters change, and how the image changes as a result. The parameters here are the amplitude at low frequencies and the amplitude at high frequencies.

You can see that I implemented all this using TouchDesigner.

Of course, all this must be adjusted to specific tools.

So, to create a visualization, a tesseract is created containing pictures for all possible values of the parameters that we take from the music. There can be as many parameters as you like, but you need to understand that 2 parameters per 100 values is 10,000 pictures. And 3 – already immediately 1,000,000. It took me a day and a half to generate a 100×100 tesseract.

Tesseracts can be very different in content.

Creating them is a real full-fledged creative process. Now, like any neural network video, such graphics lack smoothness. It is very difficult to achieve the maximum proximity of two adjacent frames with similar generation parameters. So right now I’m working on that.