Music2Dance: how we tried to learn to dance

Hello everyone! My name is Vladislav Mosin, I am a 4th year undergraduate student “Applied Mathematics and Computer Science”At the St. Petersburg HSE. Last summer, together with Alina Pleshkova, a graduate student of our faculty, I did an internship at JetBrains Research. We worked on the Music2Dance project, the goal of which is to learn how to generate dance moves that fit the given music. This can be used, for example, in self-study of dancing: I heard the music, launched the application, and it showed movements that are harmoniously combined with this music.

Looking ahead, I will say that our results, unfortunately, turned out to be far from the best models of motion generation that exist now. But if you are also interested in sorting out this problem, I invite you under cat.

From the movie
From the movie “Pulp Fiction”

Existing approaches

The idea of ​​generating dance from music is pretty old. Probably the most striking example is dance simulators such as Dance Dance Revolution, where the player must step on the panels on the floor that are glowing in time with the music, and thus a kind of dance is created. Also a beautiful result in this area is the creation of dancing geometric shapes or 2D men.

There is also more serious work – the generation of 3D movements for people. Most of these approaches are based solely on deep learning. Best results for summer 2020 showed DanceNet architecture, and we decided to take it as a baseline. We will discuss their approach in more detail below.

Data preprocessing

In the task of generating dance from music, there are two types of data: music and video, and both need to be preprocessed, since the models cannot work with raw data. Let’s talk about the processing of each type in a little more detail.

Music: onset, beats, chroma

Probably the most common way to extract features from audio is by counting a spectrogram or melgram – converting sound from an amplitude domain to a frequency domain using the Fourier transform. However, in our task we work with music, not an arbitrary audio signal, and low-level analysis is not suitable in this case. We are interested in rhythm and melody, so we will extract onset, beats and chroma (note start, rhythm and melody mood).

Video: Extracting a Human Pose

Everything is much more interesting here. A primitive approach – trying to predict the next frame from a video of people dancing – is doomed to failure. For example, the same dance, filmed from different distances, will be interpreted in different ways. In addition, the dimension of a frame, even at a small resolution (for example, 240×240), exceeds several tens of thousands of pixels.

To work around these problems, human posture extraction from video is used. The pose is set by a number of physiologically key points of the body, which can be used to restore it entirely. These points include, for example, the head and pelvis, elbows and knees, feet and hands.

Key points are marked in blue
Key points are marked in blue

This method allows you to reduce the dimension of the input data (after all, now instead of a frame with several tens of thousands of parameters, a low-dimensional vector acts as an input), and also allows you to concentrate more on movements.

An important feature of this method is the way of storing the positions of key points. For example, if you simply store absolute positions in 3D space, then the problem of non-fixed bone length arises: the distance between the shin and knee, or the shoulder and elbow, can change from frame to frame, which is not the expected behavior. To avoid such problems, the position of one point of a person is fixed, namely the middle of the pelvis, and the position of all the others is set through the length of the bone and the angle of rotation relative to the previous point. Let me explain with an example: the position of the hand is set through the length of the radius, as well as rotation relative to the elbow.

DanceNet architecture

DanceNet architecture.  Source: https://arxiv.org/abs/2002.03761
DanceNet architecture. Source: https://arxiv.org/abs/2002.03761

DanceNet architecture consists of several main parts:

  • Music encoding;

  • Classification of music by style;

  • Video frame encoding;

  • Predicting the next frame from the previous and music;

  • Decoding the received frame.

Let’s take a closer look at each of the parts:

  1. Music encoding. The pre-converted audio signal is encoded using a convolutional neural network with a Bi-LSTM layer.

  2. Classification of music by style. Similar to the previous point, a convolutional neural network with a Bi-LSTM layer.

  3. Frame encoding-decoding. Small two-layer convolutional network.

  4. Predicting the next frame. The most meaningful part of architecture, which actually predicts the next pose from the previous ones and the music. Consists of blocks of dilated bundles with skip connections.

DanceNet accepts music and a set of poses as input, but it gives out not just a pose, but a parameterization of the normal distribution – the mathematical expectation and variance, from which the answer is sampled, and the minus the probability of the correct position is used as a loss function.

Our solution

There is one major problem with existing deep learning solutions. To look realistic, you need to manually implement various kinds of constraints. For example, elbows and knees cannot bend in the opposite direction, with a static body position, the center of gravity must be between the feet, and many other restrictions. To solve this problem automatically, we suggest using reinforcement learning. The main part of this approach is to have an environment that prevents the agent from making incorrect positions.

Solution architecture.  The era of training
Solution architecture. The era of training

Our solution has four main parts:

  • Dataset

  • Model for Predicting Next Position (DanceNet)

  • Model to correct the next position of the position (RL model)

  • Loss function

Dataset

One of the main issues in solving machine learning problems is the choice of a dataset. There were no good quality open datasets for the dance generation task for the summer of 2020, and the authors of the existing solutions collected them on their own. In the article, which we took as a baseline, we seriously approached the issue of data and filmed several hours of professional dancing. Unfortunately, they refused requests to share the dataset solely for the benefit of science. Since living without a dataset is completely sad, I had to invent something. As a result, we decided to create a dataset from what we found at hand: a video from YouTube and a library for detecting a person’s position VIBE.

DanceNet

In our decision, we used the original model from the article with one small change – we removed the music classification, since, firstly, our goal was to generate a dance for any music, and secondly, the collected data did not contain markup and was very diverse in the musical plan.

RL model

The task of the RL model is to correct the position of the body so that it looks more realistic. At the input, the model takes the movement (for each point the difference between the new and old positions) and the old position of the body, at the output it gives out the corrected new position.

Let’s consider the model in detail. Reinforcement learning has two main parts: the learning algorithm and the environment.

Structure of Reinforcement Learning Algorithms
Structure of Reinforcement Learning Algorithms

For the reinforcement learning algorithms, we decided to choose one algorithm using Q-Learning (our choice fell on TD3, as the most stable and expressive) and one that does not use (we settled on PPO).

The environment turned out to be not as simple as with the algorithms. In an amicable way, for this task there is no perfect ready-made environment, and you need to implement it yourself. However, writing an environment that will be completely correct from the point of view of physics is a rather difficult and time-consuming task that requires additional knowledge. In this regard, we decided to use the ready-made environment Humanoid, which is designed to teach the agent to move as fast as possible and not fall.

Loss function

The main task of the loss function is to take into account both the reward of the environment, which does not allow unrealistic positions, and the correspondence to the position from the dataset.

L (S, S_ {real}, R) = -  parallel SS {real}  parallel_2-R,

where S – position, Sreal – correct position, R – environment reward.

Model in testing phase

Solution architecture.  The era of testing
Solution architecture. The era of testing

During the training phase, the model predicted the next body position from the previous and music, however, during the testing phase, only music comes to the input of the model, and there are no previous positions. To avoid this becoming a problem, we added to all dances a few seconds of a stable stationary position at the beginning. Now, in the era of testing, we can initialize the previous positions with this very immovable stable equilibrium and predict the next one already on the basis of music.

results

Unfortunately, we did not get impressive results, like the authors of DanceNet… Our agent has learned to generate some dance movements, for example, rotation around its axis and smooth movements of the hands, however, they do not add up to a full-fledged dance. We attribute this primarily to the difference in the quality of datasets. Unfortunately, we were unable to collect really high-quality data – automatic position extraction from YouTube videos significantly differs in quality for the worse from manual footage of professional dancers, and the RL algorithm was unable to compensate for this.

Our sad dance. Thank you for your interest!


More material from our internship blog:

  • How to combine 10 BERTs for general text comprehension tasks?

  • An unmanned taxi drives yellow rubber ducks around the city! Problem checking module for the Gym-Duckietown platform

  • Google internships: Zurich, London and Silicon Valley


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *