How to Bring Kandinsky to Life with Rotation Matrices for Video Generation – The Splitter Next Model (Part 3)

In the first part, a method for generating video by influencing text embedding with changes from embeddings of frames of another video through rotation matrices was shown. In the second part, initial approaches and implementations for generating video from text using machine learning of a simple Splitter model were shown. The task of the Splitter model is to create a series of close text embeddings, which will then be used by the Decoder to generate close images. After receiving the initial results, the task of improving and deepening the Splitter model has already arisen.

Search for further improvement

A review of papers on rotation matrices in machine learning reveals their limited use. Rotation matrices are mostly used in problems involving 3D and 2D spaces. However, one paper, “On Learning Rotations” from 2009, deeply analyzes rotation groups and suggests using the von Neumann divergence as a measure for measuring distances between rotation matrices.

This study inspired me to consider the possibility of integrating rotation matrices to deepen the loss function for comparing changes in rotation matrix-based embeddings in different spaces. The base loss function is based on MSE and CosineEmbedding between embedding differences, which fits well the basic scenario of comparing changes in both latent spaces using metric and angular methods.

I did not delve into the topic of loss function development in this article. There is still a lot of work and ambiguous results. In general, everything leads to geometric methods and topology of multidimensional spaces. And the following gifs show well that the choice of a trajectory in a latent space to create close vectors from which close pictures for a film will then be generated is very ambiguous:

An example of application of the learning of the loss function based on the transformation

An example of application of the learning of the loss function based on the transformation

Spoiler: there are more interesting examples based on the transformation loss function, which provides motivation for research into the implementation of loss functions based on transformation in space.

Splitter_Dual with transform loss

Splitter_Dual with transform loss

The article aroused additional interest Rotated Word Vector Representations and their Interpretabilitywhich explores angular representations and rotation matrices in the context of NLP.

And the article Enhanced transformer with rotary position embedding introduced me to a new approach to token positioning through word vector rotation. This allows the model to take into account both the absolute position of tokens and their relative distance to each other, which can be extremely useful for understanding text structure and improving the accuracy of the model. RoPE, introduced in the RoFormer layer, offers a method for encoding positional information using rotation matrices applied to query and key vectors in the Transformer attention mechanism, which ensures the stability of vectors and the preservation of relative positions without losing context. Implemented through vector rotation operations, it improves the efficiency of data processing.

RoPE

RoPE

Instead of adding a position vector, the RoPE layer applies a rotation to the word vector, where the rotation angle is proportional to the word position in the sentence. This ensures the stability of the vectors and preserves relative positions. RoPE introduces a new way to encode positional information using rotation matrices applied to the query and key vectors in the transformer attention mechanism. Technically, the rotation is implemented through simple vector operations to improve efficiency.

The slides above show enriching embeddings by adding rotations to them. This is used in rotary embeddings in transformers, such as the T5 model, which complements the discrete nature of texts. This approach also characterizes well what multidimensional rotation matrices do in training the Splitter model in additional training steps, to enrich new starting vectors and text embeddings.

I decided that it would be worthwhile to add a RoPE layer to the Splitter model to preprocess the incoming last-hidden-state text embeddings. I also used a block structure for the model instead of simply alternating layers. All of these things can improve the performance of the Splitter model.

Givens Coordinate Descent Methods for Rotation Matrix Learning in Trainable Embedding Indexes

The article also gave me confirmation of the direction of my search. Givens Coordinate Descent Methods for Rotation Matrix Learning in Trainable Embedding IndexesIt is devoted to the application of rotation matrices in recommender systems for searching documents in search databases and embedding these matrices in the learning process.

My rotation matrices are applied in the same way, but I decided to clarify the training a little. In text models, often, in addition to the last-hidden-state layer that I need, there is a layer of text embedding projections. It is also in the CLIP-ViT‑G model, which I am using at the current stage of work. I decided to add data from this layer to the preprocessing for data collection, since they are convenient for finding out the cosine proximity of text embeddings and unclip vectors, both input and predicted.

Changes in the model structure

To complicate the structure of the neural network when working with large numbers of video descriptions, I also decided to slightly change both the input data and the internal structure of the neural network.

Previously, to describe what frame we need from the current state, a natural number was fed into the neural network, which characterizes the distance between frames, and the direction class 0 or 1 (forward, backward) was also fed. But knowledge of the length of the film is not fed into the neural network, although indirectly this is of course present in the training of the neural network, in the work of the loss function.

Information about the direction and force of the step

I decided to combine the information about the direction and strength of the step into a single value in the range (-1, 1), normalizing the frame position to the total number of frames. This change not only simplifies the input data, but also potentially improves the accuracy of the model's predictions due to the fact that the neural network can now take into account not only the direction of movement in time, but also its intensity relative to the total duration of the video.

A linear layer is used to process this new input, which allows the continuous value to be converted into the format the model needs. This solution adds flexibility, as the linear layer can be tuned to capture subtle nuances of frame changes that would be more difficult to achieve using standard embedding methods for discrete classes.

Adding Rotary Positional Embeddings

To improve the understanding of the text structure and the position of words, a layer of rotational positional embeddings RoPE is added, which combines the advantages of absolute and relative positional vectors, representing positional information through the rotation of the word vector. This should allow the model to understand both the absolute position of tokens and their relative distances.

Adding a layer of cross-attention

In addition to the RoPE layer, a cross_attention layer is added to the Splitter model. Information from text embeddings is additionally encoded by rotational position embeddings and then compared with information about the direction and strength of the step in the cross_attention layer.

Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs

Adding Projection Text Embeddings

As I wrote above, adding the projection of text embeddings was needed to calculate the cosine proximity between text and unclip vectors. I decided to replace the external coefficients used at first with dynamic ones. Before rotating the last-hidden-state embeddings, the cosine proximity is checked against their projection and unclip vectors, and the proximity value is used in the modified class of rotation matrices, due to which the matrices rotate the last-hidden-state vectors more accurately. The same softer rotation is done during the regression step in training and during the regression generation of unclip vectors for subsequent generation of images using the Decoder.

Adding precision

Initially, I focused only on the errors of the model and took into account both their training level and the model update (see the trainer diagram in Part 2). Gradually, it became clear how to calculate the accuracy of the model. The primary unclip vectors from Prior and the image vectors from the CLIP-ViT‑G model have some cosine proximity, since the Prior model learned to bring them closer together.

As a result, such accuracy became a completely adequate parameter in addition to the error data. And the emergence of a working version of the accuracy calculation gave me a hint on how I can dynamically parameterize the contribution to the total error during training from the regression step. This is just the accuracy value averaged by a window back over epochs. Due to this, the model also began to learn more smoothly, so now the error from the regression step makes a greater contribution to the final error per epoch with increasing accuracy of the model.

Transition to a block structure of the model

I replaced the regular linear layers, supplemented with activation and regularization functions, with Improved Blocks with gradient passthrough. Improved Blocks, used instead of a simple cascade of linear layers, introduce a number of key features and advantages to the model that can significantly improve the quality of training and the efficiency of the architecture.

Structure of the updated model Splitter Next built on a block structure, with cross-attention layers and a trainable Dropout layer in the step-down blocks.

Spliter.png

Splitter Next

Results of applying the updated model Splitter Nexttrained with the updated trainer.

It is worth noting that we managed to achieve a small change in the first generation from what the pure Text-2-Image model (left) gives. This can be useful for connecting videos with similar texts for the first step in a new generation with new text.

Below are examples of different types of model generation, both simply by the step strength and by regression steps, including with rotation of input embeddings based on a comparison of input and generated vectors. And also an example of the Kandinsky video model, built on the Kandinsky 3.1 model.

Split_Dual

Next, I experimented with mixing information from full text embeddings and unclip vectors through different versions of the cross-attention layer. And it turned out that this is a very unstable combination of information in training.

The most effective model turned out to be the double structure of the Splitter.

Spliter_Dual.png

Split_Dual

Incream_dog_good_2.gif
gen_from_gen_increament_dual_acc_all_ways_base_losses_norm_cos_1_new_data_300_seed_70804.gif

gen_from_gen Splitter_dual

Both examples above are obtained from the Splitter_dual model with an autoregressive version of creating embeddings for subsequent frame generation by the decoder.

getting embeddings auto regression

getting embeddings auto regression

As can be seen from the diagram, after the embedding is predicted and for its autoregressive use for the next prediction by the Splitter_Dual model, the text vectors are rotated through the rotation matrix obtained from the changes between the predicted and starting embedding, as is used in training.

As a result, the model becomes quite creative and interesting. Although its embeddings for decoder generation are already very far from the original image of the model obtained from Text-2-Image.
If you let the model generate embeddings further, its hallucinations will be clearly visible. But at the same time, there is a logical coherence in what is happening, although sometimes funny.

gen2gen_DualBranchSpliter_textfrizee_seed_70804.gif

gen2gen_SpliterDual_textfrizee.gif

gen_from_gen_increament_dual_rote_all_ways_base_losses_norm_cos_1_new_data_300_with_dim2norm_1_seed_70804.gif

gen2gen_splitter_dual_rote_all_ways_base_losses_new_data_300.gif

The main experiments, or rather the training of Splitter models, were conducted on 300-500 videos from the TGIF dataset using a T4 video card in a colab. These were my resources.

gen2gen_DualBranchSpliterCSA_trainloop_W_allways_baseloss_pwr0_tgif300500_rize_1e-75_rote_pow_0_seed_8599.gif

gen2gen_DualBranchSpliterCSA.gif

But even with such small and noisy data, small text descriptions for videos and a weak, by modern standards, text decoder in the Kandinsky 2.2 Prior, the effectiveness of the described approach is clearly visible. Since it is necessary to teach a small Splitter model, which becomes a trained rotation matrix for predicting the direction of changes in the generation vector of the ready and frozen Kandinsky 2.2 model.

Model evaluation

As part of my master's thesis at MIPT, I needed to evaluate the models I had obtained for video generation. For the evaluation, I took the dataset MSR_VTT ((Microsoft Research Video to Text) . It is often used as a benchmark for both video-2-text and text-2-video tasks. Trial tests have shown that there are some very strange or rather difficult examples for the text-2-video task.

Problems with Microsoft Research Video to Text Dataset

On the left is an example from the dataset, where the voice-over announcer describes the game, and on the right is a video generated by the model based on the description from the dataset. And there are about 40% of such ambiguous videos in the random sample.

In order to evaluate the performance of the modelI selected 30 videos from a random sample. These are my resources, since the assessment is a multi-stage and expensive process. Then I cut out the central square part from the frames and assigned text descriptions to the videos, created by the multimodal model UForm. Next I made generations of 16 frames from new text descriptions of the Splitter + Kandinsky 2.2 model and the model Text2VideoZero and then evaluated with the corresponding metrics. The FID metric from the Pytorch library, and the remaining metrics using the framework common_metrics_on_video_quality .

comparison

comparison

And as a result, it is clear that the non-specialized video model Kandinsky_2.2 supplemented with the insert Splitter_Dual shows a result within the error similar to the specialized model Text2VideoZero . In addition, Splitter_Dual was trained only on 500 videos from another dataset.

example from test generation

example from test generation

In this and the previous 1st and 2nd parts, I showed the path and results within the framework of my master's thesis at MIPT from October 2023 to May 2024.

In parallel, entire teams began to publish works on video generation. Most of those where this can be seen use an approach that makes subsequent frame generations based on already generated frames. That is, the models leave the latent space and already work as image-2-video models.

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework, 22 Mar 2024

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework, 22 Mar 2024
Kandinsky Video 1.1 used interpolations

In my work, all steps are done in the latent space of texts and a set of carefully modified embeddings is created, from which images are then separately generated, which turn out to be close in meaning and movements and suitable for combining into a video.

So it was nice to realize that I managed to solve this problem on my own at my technical and resource level, in step with entire teams, using only Colab and a small dataset. The research of this stage is also presented in my repositoriesand there are also video presentations at the DataStart conference.

As I wrote above, I continue to work on this problem and look for solutions at the level of latent space topology for other more modern image generation models with powerful text decoders. To be continued.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *