How to convert audio data to images

5 min

Treat audio processing like computer vision and use audio data in deep learning models.

Close your eyes and listen to the sounds around you. Whether you’re in a crowded office, a cozy home, or outdoors, outdoors, you can tell where you are from the sounds around you. Hearing is one of the five basic human senses. Sound plays an important role in our life. This means that organizing and leveraging audio data values ​​through deep learning is an important process for AI to understand our world. In addition, a key challenge in audio processing is to enable computers to distinguish one sound from another. This capability will enable computers to perform a wide variety of tasks, from detecting metal wear in power plants to monitoring and optimizing vehicle fuel economy.

Today, especially for the start of a new course on machine learning I am sharing with you an article in which the authors, as an example, determine the species of birds by their singing. They find fragments of birdsong in recordings made in the wild and classify species. By converting audio data into image data and applying computer vision models, the authors of this article received a silver medal (as the top 2%) in the Kaggle Cornell Birdcall Identification competition.

Processing audio as images

When a doctor diagnoses heart disease, he either listens directly to the patient’s heartbeat or looks at an EKG – a diagram describing the patient’s heartbeat. The first usually takes more time – it takes time to listen to the heart, and more effort is spent on memorizing what was heard. In contrast, a visual ECG allows the clinician to instantly assimilate spatial information and speeds up tasks.

The same reasoning applies to sound detection problems. There are spectrograms of four bird species. You can listen to the original audio snippets here. Through the eyes, a person will instantly see the differences in species in color and shape.

Moving sound waves through time requires more computational resources, and we can get more information from 2D image data than from 1D waves. In addition, the recent rapid development of computer vision, especially with the help of convolutional neural networks, can significantly improve results when considering audio as images, as we (and almost everyone else) did in competitions.

Understanding the spectrogram

The specific representation we use is called a spectrogram — it is a time-varying visual representation of the frequency spectrum of a signal. Sounds can be represented in the form of waves, and waves have two important properties: frequency and amplitude, as shown in the figure. The frequency determines the pitch of the sound, the amplitude determines its loudness.

Explanation of the parameters of sound waves

In the spectrogram of an audio clip, the horizontal direction represents time and the vertical direction represents frequencies. Finally, the amplitude of sounds of a certain frequency, existing at a certain moment, is represented by the color of the point, which is indicated by the corresponding xy coordinates.

Explanation of the spectrogram

To understand how frequencies are reflected in spectrograms, look at a 3D visualization that demonstrates amplitude using an extra dimension. The X-axis is time and the Y-axis is frequency values. The z-axis is the amplitude of the sounds at the frequency of the y-coordinate at the moment of the x-coordinate. As the z value increases, the color changes from blue to red, resulting in the color we saw in the previous 2D spectrogram example.

3D Spectrogram Visualization

Spectrograms are useful because they show exactly the information we need: the frequencies, the features that shape the shape of the sound we hear. Different types of birds and all sound-making objects have their own unique frequency range, so sounds appear different. Our model simply has to learn to distinguish between frequencies in order to achieve ideal classification results.

Chalk scale and its spectrogram

Chalk is a psychophysical unit of pitch, used mainly in musical acoustics. The name comes from the word “melody”. The quantitative assessment of sound by pitch is based on the statistical processing of a large amount of data on the subjective perception of the pitch of sound tones.

However, the human ear does not perceive differences in all frequency ranges in the same way. As the frequencies increase, it becomes more difficult for us to distinguish between them. To better mimic the behavior of the human ear using deep learning models, we measure frequencies on a chalk scale. On the chalk scale, any equal distance between frequencies sounds the same to the human ear. The unit chalk (m) is related to hertz (f) by the following equation:

$ m = 2595 * log (1 + f / 700). $

A chalk-scale spectrogram is simply a spectrogram with frequencies measured in chalk.

How do we use the spectrogram?

To create a chalk spectrogram from sound waves, we will use the librosa library.

import librosa
y, sr = librosa.load('img-tony/amered.wav', sr=32000, mono=True)
melspec = librosa.feature.melspectrogram(y, sr=sr, n_mels = 128)
melspec = librosa.power_to_db(melspec).astype(np.float32)

Where y stands for raw wave data, sr stands for the sampling rate of the audio sample, and n_mels defines the number of chalk stripes in the generated spectrogram. When using the melspectrogram method, you can also set the f_min and f_max method parameters. You can set Then and convert the spectrogram to a chalk spectrogram, expressing amplitude on a rectangular scale, to decibels using the power_to_db method.

To visualize the generated spectrogram, run the code:

import librosa.display
librosa.display.specshow(melspec, x_axis="time",  y_axis="mel", sr=sr, fmax=16000)

Alternatively, if you are using a GPU, you can speed up the spectrogram generation process using the torchlibrosa library.

from torchlibrosa.stft import Spectrogram, LogmelFilterBank
spectrogram_extractor = Spectrogram()
logmel_extractor = LogmelFilterBank()
y = spectrogram_extractor(y)
y = self.logmel_extractor(y)


To conclude, we can take advantage of the latest advances in computer vision in sound applications by converting audio clip data into image data. We achieve this by using spectrograms, which show information about the frequency, amplitude and time of the audio data in the image. Using the chalk scale and the chalk scale spectrogram helps computers simulate human hearing in order to distinguish sounds of different frequencies. To generate spectrograms, we could use librosa or torchlibrosa for GPU acceleration in Python. By looking at sound problems in this way, we can create effective deep learning models for identifying and classifying sounds, just as, for example, doctors diagnose heart disease with ECGs.


Leave a Reply