Generating Music with Stable Diffusion

Many have already heard, or maybe tried the Stable Diffusion model for generating pictures from text.

Now the same model can be used to generate music! The model was further trained on the task of generating spectrograms from the input string, and now it is possible to do the following:

The whole point is that the resulting spectrogram can be easily converted into an audio clip.


Wow! And what was possible? Yes!

This is the V1.5 version of the Stable Diffusion model, we get a spectrogram from the input line, and then in the code we convert the spectrogram into sound. Moreover, it is possible to generate infinite sound variations by changing the random seed. And yes, all the same techniques work as in Stable Diffusion: inpainting, nagative prompt, img2img

About spectrograms

An audio spectrogram is a visual way of representing the frequency content of an audio clip. The x-axis represents time and the y-axis represents frequency. The color of each pixel determines the amplitude of the sound as a function of frequency and time.

A spectrogram can be obtained from sound using the Fourier Transform (STFT), which approximates sound as a combination of sine waves of varying amplitude and phase.

The STFT algorithm is reversible, so the original sound can be reconstructed from the spectrogram. However, the spectrogram images from the Riffusion model only show the amplitude of the sine waves, not the phases, because the phases are chaotic and difficult to predict. Instead, the algorithm is used Griffin Lima for phase approximation in audio reconstruction.

Below for intuitive understanding is a hand-drawn image that can be converted to audio.

For audio processing, torchaudio is used, as it has excellent capabilities for processing on the GPU. Audio preprocessing code can be found here


With diffusion models, you can generate results not only based on text, but also based on images. This is incredibly useful for modifying sounds while maintaining the structure of the original clip that you like. You can control how much to deviate from the original clip towards the new result using the denoising parameter.

Here is an example of converting rock and roll solo to violin.

rock and roll electric guitar solo

acoustic folk fiddle solo

Interpolation and generation of long audio

Creating short clips is cool, but I also wish I could generate endless audio in a similar way.

Let’s say we created 100 clips with different random seeds from one input line. We can’t merge the resulting clips because they differ in tone and tempo.

The strategy is to create a large number of variations of the same initial image using img2img with different random seeds and input strings. This preserves the basic properties of the audio.

However, even with this approach, the transition between clips will be too sharp. Even one line with a different random seed will have a different motive and atmosphere of the melody.

To solve this problem, you can smoothly interpolate between the input strings and the random seed in the model space (latent space). In Diffusion models, latent space is a feature vector that includes all variations of what the model can generate. Elements that are similar to each other are nearby in the latent space, and each value of the latent space vector can be decoded into some human-readable output of the model.

The trick is that you can get the latent space between an input string with two different random seeds, or even two different input strings with the same initial radom seed. Here is an example with an image generation model:

The same can be done with audio, even between, at first glance, different audio, you can get very smooth transitions.

The picture below shows interpolation in space between two random seeds of the same input string. In this way, we achieve a much smoother playback of the sequence of sounds. All the clips that we generate can have their own atmosphere and their own motive, but the interpolation result is impressive:

Interpolation between print and jazz is something!

The huggingface library has a great chapter dedicated to Diffusion models for img2img and interpolation between strings. But the authors provided their own code for the audio generation task, which we support masking – the ability to generate and change only part of the spectrogram. The code

Testing the model

The authors have made a convenient web interface on ThreeJS, React and Tailwind, where everyone can try to generate something of their own.

Link to web application

If you want to try to generate something more complex, then the authors provide all the source code, and the model can be run even on Google Colab

More examples for the Riffusion model can be found in my telegram channel. I write about ML, startups and UK relocation for IT professionals.

Similar Posts

Leave a Reply