How to Bring Kandinsky to Life with Rotation Matrices to Generate Video (Part 1)

This publication about the text2video problem is based on my recent thesis in the master's program at MIPT and is my first article.

Initially, the topic of my master's thesis was formulated as — video generation based on text description and Stable Diffusion. I started working in early October 2023. By that time, there were still few publications on the topic of video generation and ready-made code for solving the text2video problem. Then I remember 2 projects: Text2video-zero And ANIMATEDIFF.

Text2video-zero

The Text2video-zero project already had an article from March 2023, in which the authors proposed adding a time axis to the structure of the U-net diffusion model, and teaching the model to generate a batch of frames at once, also training on a batch of sequential images from a video. Which is quite logical.

ANIMATEDIFF

The Stable Diffusion website announced the ANIMATEDIFF project, which describes approaches tested by the team for generating video by influencing noise and by including various LoRa adapters for the Unet model in the finished Text-2-Image models, which should be trained, while freezing the main Unet layers, to take into account the existing changes in successive frames.

I was interested in finding something of my own, and in general, my immersion in diffusion models was still at the very beginning at that time, and I decided to go from the simplest.

What is the problem with generating video from text?

Unlike generating an image, we need to obtain a series of maximally similar images, in which there are small changes specified by the text itself.

As you can see from the slide, this is not such an obvious solution.

In essence, the verb “washes” should give related changes, and the words “mama” and “frame” should not change the pictures. But how to do this?

Pixel and Clip Space Video

First I looked at the pixel space. On the slide you can see both the frames and the changes between them. And from the last frame you can get to the first one by subtracting the changes.

Then it became interesting, what happens with Clip vectors of frames and stretched into vectors matrices of pixels (flat vectors). Or more precisely, how similar are these types of vectors between different frames. For this, I simply built correlation matrices of frames between themselves.

correlation of frames with each other

correlation of frames with each other

Let's take a closer look at them.

The pixel space correlation matrix shows a wider range of correlation values, with a noticeable decrease in them compared to the CLIP space. This indicates a greater sensitivity of the pixel space to small changes between frames.

The correlation matrix of the CLIP space shows more stable and smooth transitions between adjacent frames. This indicates that CLIP embeddings abstract high-level information from images, while pixel space is more sensitive to low-level details.

But if you look at the changes of only adjacent frames, you can see the relationship between the two spaces in the transitions between adjacent frames. This is manifested in the fact that both spaces capture the main changes in the video, but with different degrees of sensitivity to details.

I suggested that it would be possible to form some kind of displacement tensor based on the correlation matrix of the image space clip, which could be used to act on text vectors or on the starting noise itself.

Formation of the displacement tensor

Formation of the displacement tensor

The Clip model learned to bring vectors of text space and image space closer together. So, one could expect something reasonable from such manipulations

Next, I formulated my hypothesis, which I laid down as the basis for my research – Controlled changes to text embeddings can result in small changes to the generated images to form a video sequence.

First experiments

At first I decided to experiment with modifying the noise by acting on it with a similar identity matrix close to unity from generation to generation, while preserving the original noise, fixing the seed.

conditional example of the formation of the noise displacement tensor

conditional example of the formation of the noise displacement tensor

With some matrix parameters, such small changes began to occur. Which clearly gave a hint that the idea was not useless.

Noise modification in SD 1.4

Noise modification in SD 1.4

How to go to the texts?

From the NPL course I remember that in the Word2Vec embedding space there is also a geometric relationship between embedding vectors and there is angular similarity of vectors.

As a result, the concept of embedding rotation began to suggest itself. And the rotation immediately led to the Rodriguez rotation formula for 3D space.

R = I +sin⁡(θ)×K + (1 –cos⁡(θ) )×K^2

Where

Rodriguez Turn Formula

It is widely used in 3D graphics, robotics and many other places. And the formula itself is also called the rotation matrix. It would seem, what does 3D space and object rotations in it have to do with it? I am always asked this question. Much of what I write below is already a search for answers to this question, but I was driven simply by intuition that transformations in spaces of any dimension must obey certain general principles and there must be invariants and laws of their transformation.

Rotation matrices of multidimensional space

To move to N-dimensional space, one needs to dive into Group Theory. Where, through the concept of an Abelian group, an exponential mapping of a Lie algebra into a Lie group and the use of rotation generators are possible. A multidimensional rotation is decomposed into a product of two-dimensional rotations. Each two-dimensional rotation affects only the two dimensions in which it acts, leaving the others unchanged.

Screenshots of the introductory course of lectures on Group Theory

When working with generators A in the form of skew-symmetric matrices, the use of the exponential function is the basis for a smooth transition from algebra to geometry, in particular, from infinitesimal transformations to finite increments. This is important in physics and engineering, where abrupt changes can lead to undesirable behavior, such as mechanical failures or unrealistic animation in graphics.

A=(n_1 n_2^T-n_2 n_1^T )

Where n1 And n2 — n-dimensional orthogonal unit vectors

Exponential representation of the rotation matrix via Taylor series expansion and rearrangement of terms will lead to the formula for the matrix of N dimensional rotations. The slide shows the basic formula for the rotation matrix between two vectors in multidimensional space, consisting of 3 terms,

Rotation in high dimensions

in order:

identity matrix – the term ensures that the components of a vector aligned with the rotation axis are not affected by the rotation.

skew-symmetric term – the term is crucial for creating a rotation in the plane perpendicular to the axis formed by vectors n_1 and n_2. This term is responsible for the actual rotation effect.

symmetric term – the term corrects the components parallel to the axis of rotation and their contribution to the total rotation.

Experiments with rotation matrices

The discovery of the formula for N-dimensional rotations laid the foundation for the first experiments with rotating text embeddings.

applying rotation matrices to embeddings

applying rotation matrices to embeddings

The slide schematically shows that changes in the text vectors used for generation in the diffusion model are transmitted via the rotation matrix obtained from the vectors of the corresponding frames. As can be seen from the slide, for testing the i+1 vector, a small supplement from the matrix product of the rotation matrix by the same i-th vector is obtained by adding to the i-th vector a small supplement from the matrix product of the rotation matrix by the same i-th vector, and the smallness of the contribution is determined by the coefficient g. This is similar to perturbation theory.

The first experiments were conducted on the Stable Diffusion 1.4, 1.5 models. A video clip was taken, then a rotation matrix was calculated from the Clip vectors of adjacent frames, which was then applied to the text embedding before generating the image. The slide shows successful examples.

experiments with SD 1.4(5)

experiments with SD 1.4(5)

Then I was interested in the Kandinsky 2.2 model, which is built on the unclip approach. Where there is a diffusion model, which actually works as Image-2-Image. And the text information is concentrated in the Prior model, which learns to produce vectors maximally close to image vectors from projections of text vectors. A similar structure is in the DALLE model.

In Kandinsky 2.2, the embeddings after the Prior model have a greater resemblance to the image embeddings, which should in theory work better with the multidimensional rotation approach.

Kandinsky 2.2 generation together with rotation matrix

Kandinsky 2.2 generation together with rotation matrix

The slides below show some examples of Kandinsky 2.2 Decoder generations from modified Prior model embeddings with rotation matrices obtained from changes in third-party video footage.

influence of third-party video sequence on embeddings for image generation

influence of third-party video sequence on embeddings for image generation

influence of third-party video sequence on embeddings for image generation

influence of third-party video sequence on embeddings for image generation

Experiments using rotation matrices and the Kandinsky 2.2 model were conducted with text, with noise and in a complex manner.

on the right is the frame-by-frame noise modification based on the previous generation

on the right is the frame-by-frame noise modification based on the previous generation

More examples of video sequence generation.

As a result, experiments with rotation matrices demonstrated the possibility of transmitting information about changes between latent spaces of different modalities.

Research has shown that:

  • control of generations via rotation matrices is possible;

  • Change control and management can become the basis for developing machine learning methods.

The results excited me, as they showed the direction of my further search. I will tell about the results in the next part.

You can listen to my report on the above material at my channelwhich has just begun to fill. Research at this stage is presented in my repositories.

I would be glad to receive questions and comments both on the topic and on the style of presentation, as I am just learning to write articles. To be continued.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *