How to Animate Kandinsky with Rotation Matrices for Video Generation – The Splitter Model (Part 2)

In the first part, I introduced you to an approach for video generation based on rotation matrices. Intuition led me to it, and then I started working on formalizing the idea, after an initial immersion in group theory. Then I was ready to move on to solving the problem based on machine learning.

Hypothesis

The methodology is based on the approach used in video codecs.

  • key frame

  • an algorithm that describes changes from a reference frame to a certain depth.

    I-Frame

A widely used approach I-Framewhich uses a reference frame and subsequent change frames, suggests that we need to learn how to pass information about changes in the latent space of images to a model that operates on the boundary of two latent spaces!

The slide below shows schematically two latent spaces of different modality.

latent spaces of different modalities

latent spaces of different modalities

In addition to the main vectors, there are vectors of changes. Since previous experiments were based on changes, a hypothesis arose.

Hypothesis for the text2video task

MIt is possible to construct a training model that predicts the vector for generating the i-th frame by passing information to the model only about changes from the 0-th frame via a loss function.

Loss function

Previous experiments have shown that it makes sense to build neural network training on changes. The changes themselves can be different, and from the point of view of vector space, these can be angular changes in the vector, as well as the vector length itself. Changes in pixels in frames, through the encoder transformation, will be reflected in the image embeddings. Therefore, to train a conditional model for these small changes in frame embeddings, a combination of two error functions is taken as the main option. Based on preliminary experiments and analysis of vector changes in the latent space, it was decided to use a combined approach, including two different loss metrics: CosineEmbeddingLoss And MSEloss.

The slide below shows a loss function that relates changes in spaces through angular and metric data.

Combined loss function

Combined loss function

It looks like a simple formula, but with a peculiarity – the output of the loss function for calculating gradients should not be a number, but a vector of the latent space dimension. In fact, when learning, we do not refer to the average gradient at a point, but to the gradient field. And we will learn on changes in different spaces, that is, changes in changes of the field.
There is also a similarity with the approach “Video compensation” in video codecs. There the picture is divided into blocks and changes in the blocks are searched for during transitions between frames, which are encoded as change vectors.

"Video compensation"

The model that needs to be applied to the latent space of texts will learn from somewhat similar change vectors, but in the latent space of images.

Splitter – what kind of beast is it?

To implement the training algorithm, I decided to use the Kandinsky 2.2 model, which I already used for the first tests (see Part 1). Kandinsky 2.2 is built on an approach similar to unclip, which is used in some versions of Stable Diffusion and DALL-E2 . More precisely, Kandinsky 2.2 uses the diffusion_mapping approach to transform high-dimensional text embeddings into latent embeddings while preserving geometric properties and connectivity. The output of this process is embeddings of the same size as the embedding from the image encoder.

Kandinsky 2.2

Generation in the Kandinsky 2.2 diffusion model is built on the Image-2-Image principle, its second part – Decoder, contains the Unet Image-2-Image diffusion model and the MOVQ model for upsampling the image. The first part of the model contains the Prior model, which is trained to bring together unclip text embeddings and image embeddings. They have a high cosine proximity, and Unet in the decoder learns to restore images from image embeddings from noise. This functionality of the Kandinsky 2.2 model in tasks where continuity and dynamics of scene changes are required can provide a deeper understanding of text transformation in slightly changing frames and is extremely convenient for experimenting in my approach.

The proximity of unclip text embeddings and image embeddings will also allow combining images and thetas to create new images using Kandinsky 2.2. The structure of the model, the ability to train and use modules separately, and the quality of image generation turned out to be a very convenient test algorithm for my experiments in adding my own module, which I called Splitter (splitter) to obtain from the Kandinsky 2.2 model a composite model Kandinsky 2.2 + Splitter, which will already be capable of creating a video sequence.

Kandinsky 2.2 + Splitter

Kandinsky 2.2 + Splitter

To adapt it to the Text2Video task, the Splitter model is used between Prior and Decoder, which should change the vectors used to generate images by the Decoder. It is important to note here that the Splitter model learns only in the latent space, and the Decoder is not needed for learning.

Splitter takes as input:

  • the ordinal number of the predicted vector,

  • full text embeddings from the CLIP-ViT‐G model used in Kandinsky

  • and starting embedding from the Prior model.

Splitter Basic Training Scenario

Splitter Basic Training Scenario

If the proposed scenario is working, then the first results should be on a simple model.

Splitter has a simple configuration of input embedding layers and then a cascade of down-sampling linear layers and regularization and nonlinearity layers. At the output is a predicted modified embedding, which can be used later to generate an image decoder model Kandinsky 2.2.

The training scenario is self-written and will be discussed further. Everything is trained on the T4 card, which is essential for the volume of experiments that needed to be carried out.

Dataset

To create the data required for training, a simple and convenient dataset was used. TGIF.

pros – this is convenience, variety, a large number of videos and the length of the videos within 100 frames.

minuses – mostly low resolution, some static videos and short text descriptions.

To filter out problems, a data set for future training of the Splitter model was prepared by a separate script. The data obtained by the script is a Pytorch dataset of already vectorized data. Due to resource limitations, 200 videos filtered and vectorized by the script were automatically selected for the initial test training.

Initial training and testing

  • Training is carried out by randomly pulling out a batch of frames from a film if the film is longer than the batch size, or by taking all frames if the film is shorter than the batch size.

Inference scheme of the model from the number of the generated frame

Inference scheme of the model from the number of the generated frame

examples of first generations due to simple Splitter model

examples of first generations due to simple Splitter model

The first results of a simple approach inspired me for further research. And this is especially noticeable in a dancing couple, where, despite the instability of the background and clothing, there is still a complex connection in their joint movements.

Search for improvements

It seems like there is a lot of data, each video has from 15 to 100 frames, but the text is the same for all frames.

If you play the video backwards, the same description will often work. We will teach the model in both directions, from the initial frame and the final one, giving the model a direction label at the input.

adding a direction label to the dataset and model

adding a direction label to the dataset and model

Another hypothesis is – SCISSORS.

New opening frame

New opening frame

You can cut frames from the beginning or end of the video and start with a new frame. And the description will often work too.

But for correct operation and data enrichment, we must change the starting text vectors through the rotation matrices obtained from the image vectors of the original starting frame and the new starting frame.

Using Rotation Matrices for Augmentation

Using Rotation Matrices for Augmentation

This is how rotation matrices appear in training. To work in the training process, it was necessary to implement a script for their application on tensors so that all calculations were carried out on the video card. In the first part of the article, where there was no training, rotation matrices were implemented in numpy.

Calculating rotation matrices

Calculating rotation matrices

We apply the obtained rotation matrix from the old and new reference frame to the original unclip vector to obtain the modified unclip vector. We do the same to modify the full text vectors based on the rotation matrix obtained from the old and new unclip vector.

Since training occurs on random frames from the film, it is necessary to take into account the number of the “new zero” frame on the fly and select only those frames that are in the random sample further in the film, since those before the “new zero” frame will be used for training when reversing the direction. The class for accounting for the order of frames and their mixing and control of the frame's place in the batch after mixing was supplemented. Accounting and control of the place and distance from the reference frame and pseudo-reference frame to the frame in the random batch is a very important point, since in case of an error, averaging and “freezing” occurs instead of learning the features in the changes.

example after initial training addition

example after initial training addition

Markov chain

The output of the Splitter model is a vector of the same nature as the input. Therefore, the idea of ​​an additional regression step of the model during training arose, that is, to make a prediction from a prediction, feeding the predicted embedding back into the Splitter model.

Target:

autoregressive step

autoregressive step

Rotation matrices are also used here to transfer changes in the text space from changes in the original starting frame to the predicted starting frame.

autoregressive model capabilities

autoregressive model capabilities

The slide above shows examples of generations from model vectors obtained autoregressively, depending on how the model was trained.

It is clear that the capabilities of the model differ depending on whether there was a matrix rotation step and a regression step in its training.

Training trainer

Schematic representation of a custom training trainer that combines:

  • work with a random small sample of videos at each epoch, for more stable and generalized training of the model.

  • random batch of frames from the video, for stability against frame inconsistencies.

  • step of normal training in both directions.

  • step of learning with scissors in both directions.

  • autoregressive step in both directions.

  • The use of rotation matrices is implemented for work on a video card.

  • periodically updating the model with better weights based on different terms in the loss function to avoid getting stuck in a local minimum.

  • Each saved model checkpoint has a history of model training to be able to use statistics for new training with changed parameters.

Extended Splitter Model Training Trainer

Extended Splitter Model Training Trainer

The constructed trainer allows comfortable training and retraining of the model on large volumes of data using the T4 card. Moreover, the length of the videos can be different. 500 examples have already been selected for training.

Comparison of trained weights Splitter

Comparative generations with one seed

Comparative generations with one seed

Comparative generations with one noise based on vectors obtained from models with different types of training and their combinations.

It is quite noticeable that additional steps in training, both with simple rotation matrices and with a regression step, introduce additional information into the model and it makes generation more interesting.

Interesting results

Interesting examples of generation from vectors by step number from the starting vector. They demonstrate quite related frames of the generated video.

generation step by step

generation step by step

The slide shows examples of generations from model vectors obtained autoregressively, depending on how the model was trained.

It is clear that the capabilities of the model differ depending on whether there was a matrix rotation step and a regression step in its training.

autoregressive generation

autoregressive generation

by step number

by step number

autoregressive

autoregressive

The research of this stage is also presented in my repositories.

This was the pre-defense stage and based on its results I have already delved into complicating the Splitter model itself to understand how to improve its generative capabilities. To be continued.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *