What is the best audio generation method? Comparison of GAN, VAE and Diffusion

In the last article, I touched on the topic of sound generation using the diffusion model. But what methods exist in general and which of them is the most promising now? Today we will look at the long way this direction in machine learning. Let’s listen to the results, look at the metrics and just take a look at the new technologies used in completely different neural networks for audio synthesis.

▍ GAN: pjloop-gan

Generative adversarial networks (GANs) are deep learning models that consist of two parts: a generator and a discriminator. The generator is responsible for creating synthetic (fake) data from a random vector of input data. A discriminator is a classifier that tries to decide if the input given is real or synthetic. During the training process, both the generator and the discriminator are updated in an iterative manner to eventually create realistic synthetic data that misleads the discriminator. This can then be used to create original designs with state-of-the-art (SOTA) results in imaging and other areas. GANs have also been used to create music in both the symbolic and sonic realms.

GAN-based neural networks have been used in sound generation since the advent of architecture. However, guns are notoriously difficult to train due to the opposing goals of their two parts. The key idea is the key flaw of the architecture. To improve the quality and speed of GAN training, pjloop-gan uses Projected GAN, which adds a pretrained network of objects to the discriminator with random components that help prevent collapse.

Pjloop-gan uses a set of four “object projectors” to map data distributions in object space instead of directly distributing data. It consists of a pretrained feature network, channel covariance modulation (CCM) and scale covariance modulation (CSM). CCM mixes features by channel, while CSM further mixes them by scale. The random projections introduced by both enhance the ability of discriminators to account for the entire feature space. The predictive GAN method updates its loss by summing the output of the discriminators used.

As an evaluation of the model, the authors conducted a study in which participants were asked to rate certain audio tracks generated by different models on a five-point scale.

  • Drummability or Harmony
    Whether the sample contains percussion and harmonic sounds.
  • loopability
    Is it possible to play a sample repeatedly and at the same time smoothly.
  • audio quality
    Whether the sample is free from obnoxious noise or artifacts.

The results showed that the model designed by StyleGAN2 (VGG + SCNNLoop) performed best and outperformed StyleGAN2 and the real sound. The study also showed that generative music is more difficult.


Despite good results, GANs still suffer in the task of generating “plausible” data and are poorly controllable. Still, it’s nice that, against the backdrop of the hype of transformers and diffusion models, someone is still working on practically forgotten approaches.


Variational autoencoders (VAEs) are a type of deep generative model used primarily to generate synthetic data. They allow you to control the generation process by revealing a hidden variable, but usually suffer from poor synthesis quality.

VAE is used quite often and successfully in connection with GANs or Transformers.

The Real-time Variational Audio Encoder (RAVE) is an improved version that allows you to quickly and accurately synthesize audio signals. It uses a two-stage learning procedure called representation learning and adversarial fine-tuning. It uses multi-band raw waveform decomposition to generate 48 kHz audio signals while running 20 times faster than real-time on standard laptop processors. To determine the most informative parts of a latent space, RAVE uses Singular Value Decomposition (SVD), as well as a precision parameter that determines the rank of this space based on its singular values.

To make the model efficient, a multi-band decomposition of the raw waveform is used to reduce the temporal dimension of the data. This provides a larger temporal receptive field for the same processing power. The encoder combines multi-band decomposition and CNN to transform the raw waveform into a 128-dimensional latent representation. The decoder is a modified version of the generator that converts the last hidden layer into a multiband audio signal, an amplitude envelope and a noise generator. The discriminator is used to prevent artifacts and feature loss for better learning. Finally, it takes about 6 days to train the model on the TITAN GPU and it is compared with two unsupervised models for evaluation.

A two-step learning procedure is a process used to train a variational autoencoder, which is a type of machine learning algorithm.

Stage #1 of the two-stage training procedure is representation training, which involves using a multi-scale spectral distance to measure the distance between the real and the synthesized waveform. This is done by measuring the similarity in amplitude of the short-time Fourier transform.

Stage #2 of the two-stage training procedure is adversarial fine-tuning. Includes training a decoder to generate realistic synthesized signals using the latent space from stage 1 as the base allocation and focusing on the lossy version of the GAN target. This step also includes minimizing the spectral distance and adding loss of match.

The results of subjective experiments (MOS) show that RAVE outperforms existing models in terms of reconstruction quality without relying on autoregressive generation or limiting the range of possible generated signals. In addition, it achieves this with parameters that are at least 3 times smaller than other tested models. RAVE2 was recently released – this gives hope that the authors are ready to refine and improve their neural network. This is more like perspective.

Listen to the results (Unfortunately, there is a result only from the first version – the second one was released quite recently).

▍ Diffusion: audio-diffusion

Diffusion models have become very popular for imaging. However, they were little used to create sound. Sound generation is complex because it involves a lot of detail. For example, how sounds change over time, and how different sounds can be layered on top of each other.

Probably the most prominent representative of diffusion models in sound generation from a small number available is audio-diffusion-pytorch.

Generation is very similar to diffusion neural networks for image generation.
U-Net is iteratively invoked with different noise levels to generate new likelihood samples from the data distribution. To speed up generation and create more realistic samples, various methods are applied, such as compression with an autoencoder, which reduces the temporal dimension while increasing additional channels, while STFT increases the number of channels but reduces the length of the sound.

To improve the results, the author suggests using an additional diffusion vocoder. A diffusion vocoder is a model that takes audio signals and outputs a smoother version of them with improved sound quality. It does this by first converting the waveform into a spectrogram, a graphical representation of the frequencies in sound. The spectrogram is then aligned using a type of convolution known as transposed 1-dimensional convolution. This alignment is done in such a way as to take into account the size of the window used to create the spectrogram and the number of frequency channels it contains. The smoothed output is then upsampled, meaning it contains additional channels, and these channels are injected into the U-Net to give the audio an improved sound quality.

This approach gives not only the ability to restore high frequencies and create high-quality waveforms that are not available for many other models, but also more control over generation (look at least at Stable Diffusion, which is related to it).

Listen (really good music).

▍ Conclusion

Efficiently creating high-quality sound is a challenge, as it requires huge amounts of data to be generated to accurately reproduce sound waves, especially when aiming for high-quality 48kHz stereo sound. Of course, diffusion neural networks are now showing the greatest success in this and many other challenging parts of the AI ​​world, but classical methods are still faster and easier to use.

Which is better is up to you.

Play our scrolling shooter right on Telegram and get prizes! 🕹️🎁

Similar Posts

Leave a Reply