How to Bring Kandinsky to Life with Rotation Matrices to Generate Video (Part 1)
This publication about the text2video problem is based on my recent thesis in the master's program at MIPT and is my first article.
Initially, the topic of my master's thesis was formulated as — video generation based on text description and Stable Diffusion. I started working in early October 2023. By that time, there were still few publications on the topic of video generation and ready-made code for solving the text2video problem. Then I remember 2 projects: Text2video-zero And ANIMATEDIFF.
The Text2video-zero project already had an article from March 2023, in which the authors proposed adding a time axis to the structure of the U-net diffusion model, and teaching the model to generate a batch of frames at once, also training on a batch of sequential images from a video. Which is quite logical.
The Stable Diffusion website announced the ANIMATEDIFF project, which describes approaches tested by the team for generating video by influencing noise and by including various LoRa adapters for the Unet model in the finished Text-2-Image models, which should be trained, while freezing the main Unet layers, to take into account the existing changes in successive frames.
I was interested in finding something of my own, and in general, my immersion in diffusion models was still at the very beginning at that time, and I decided to go from the simplest.
What is the problem with generating video from text?
Unlike generating an image, we need to obtain a series of maximally similar images, in which there are small changes specified by the text itself.
As you can see from the slide, this is not such an obvious solution.
In essence, the verb “washes” should give related changes, and the words “mama” and “frame” should not change the pictures. But how to do this?
Pixel and Clip Space Video
First I looked at the pixel space. On the slide you can see both the frames and the changes between them. And from the last frame you can get to the first one by subtracting the changes.
Then it became interesting, what happens with Clip vectors of frames and stretched into vectors matrices of pixels (flat vectors). Or more precisely, how similar are these types of vectors between different frames. For this, I simply built correlation matrices of frames between themselves.
Let's take a closer look at them.
The pixel space correlation matrix shows a wider range of correlation values, with a noticeable decrease in them compared to the CLIP space. This indicates a greater sensitivity of the pixel space to small changes between frames.
The correlation matrix of the CLIP space shows more stable and smooth transitions between adjacent frames. This indicates that CLIP embeddings abstract high-level information from images, while pixel space is more sensitive to low-level details.
But if you look at the changes of only adjacent frames, you can see the relationship between the two spaces in the transitions between adjacent frames. This is manifested in the fact that both spaces capture the main changes in the video, but with different degrees of sensitivity to details.
I suggested that it would be possible to form some kind of displacement tensor based on the correlation matrix of the image space clip, which could be used to act on text vectors or on the starting noise itself.
The Clip model learned to bring vectors of text space and image space closer together. So, one could expect something reasonable from such manipulations
Next, I formulated my hypothesis, which I laid down as the basis for my research – Controlled changes to text embeddings can result in small changes to the generated images to form a video sequence.
First experiments
At first I decided to experiment with modifying the noise by acting on it with a similar identity matrix close to unity from generation to generation, while preserving the original noise, fixing the seed.
With some matrix parameters, such small changes began to occur. Which clearly gave a hint that the idea was not useless.
How to go to the texts?
From the NPL course I remember that in the Word2Vec embedding space there is also a geometric relationship between embedding vectors and there is angular similarity of vectors.
As a result, the concept of embedding rotation began to suggest itself. And the rotation immediately led to the Rodriguez rotation formula for 3D space.
Where
It is widely used in 3D graphics, robotics and many other places. And the formula itself is also called the rotation matrix. It would seem, what does 3D space and object rotations in it have to do with it? I am always asked this question. Much of what I write below is already a search for answers to this question, but I was driven simply by intuition that transformations in spaces of any dimension must obey certain general principles and there must be invariants and laws of their transformation.
Rotation matrices of multidimensional space
To move to N-dimensional space, one needs to dive into Group Theory. Where, through the concept of an Abelian group, an exponential mapping of a Lie algebra into a Lie group and the use of rotation generators are possible. A multidimensional rotation is decomposed into a product of two-dimensional rotations. Each two-dimensional rotation affects only the two dimensions in which it acts, leaving the others unchanged.
When working with generators A in the form of skew-symmetric matrices, the use of the exponential function is the basis for a smooth transition from algebra to geometry, in particular, from infinitesimal transformations to finite increments. This is important in physics and engineering, where abrupt changes can lead to undesirable behavior, such as mechanical failures or unrealistic animation in graphics.
Where n1 And n2 — n-dimensional orthogonal unit vectors
Exponential representation of the rotation matrix via Taylor series expansion and rearrangement of terms will lead to the formula for the matrix of N dimensional rotations. The slide shows the basic formula for the rotation matrix between two vectors in multidimensional space, consisting of 3 terms,
in order:
– identity matrix – the term ensures that the components of a vector aligned with the rotation axis are not affected by the rotation.
– skew-symmetric term – the term is crucial for creating a rotation in the plane perpendicular to the axis formed by vectors n_1 and n_2. This term is responsible for the actual rotation effect.
– symmetric term – the term corrects the components parallel to the axis of rotation and their contribution to the total rotation.
Experiments with rotation matrices
The discovery of the formula for N-dimensional rotations laid the foundation for the first experiments with rotating text embeddings.
The slide schematically shows that changes in the text vectors used for generation in the diffusion model are transmitted via the rotation matrix obtained from the vectors of the corresponding frames. As can be seen from the slide, for testing the i+1 vector, a small supplement from the matrix product of the rotation matrix by the same i-th vector is obtained by adding to the i-th vector a small supplement from the matrix product of the rotation matrix by the same i-th vector, and the smallness of the contribution is determined by the coefficient g. This is similar to perturbation theory.
The first experiments were conducted on the Stable Diffusion 1.4, 1.5 models. A video clip was taken, then a rotation matrix was calculated from the Clip vectors of adjacent frames, which was then applied to the text embedding before generating the image. The slide shows successful examples.
Then I was interested in the Kandinsky 2.2 model, which is built on the unclip approach. Where there is a diffusion model, which actually works as Image-2-Image. And the text information is concentrated in the Prior model, which learns to produce vectors maximally close to image vectors from projections of text vectors. A similar structure is in the DALLE model.
In Kandinsky 2.2, the embeddings after the Prior model have a greater resemblance to the image embeddings, which should in theory work better with the multidimensional rotation approach.
The slides below show some examples of Kandinsky 2.2 Decoder generations from modified Prior model embeddings with rotation matrices obtained from changes in third-party video footage.
Experiments using rotation matrices and the Kandinsky 2.2 model were conducted with text, with noise and in a complex manner.
More examples of video sequence generation.
As a result, experiments with rotation matrices demonstrated the possibility of transmitting information about changes between latent spaces of different modalities.
Research has shown that:
control of generations via rotation matrices is possible;
Change control and management can become the basis for developing machine learning methods.
The results excited me, as they showed the direction of my further search. I will tell about the results in the next part.
You can listen to my report on the above material at my channelwhich has just begun to fill. Research at this stage is presented in my repositories.
I would be glad to receive questions and comments both on the topic and on the style of presentation, as I am just learning to write articles. To be continued.