Ways to represent audio in ML

The article discusses the main forms of audio representation for further use in various areas of data processing.

The mainstream of recent years in the field of DS / ML is NLP, in particular, the prospects for using neural networks built on the architecture of transformers. They are also used in voice assistant systems, and voice assistants are firmly established in our lives. However, an important component of the success of voice assistants is that they are “voice”, that is, they are accessed through voice, which means audio. Often, work with an audio signal is performed by analyzing both the sound and the image of the spectrogram, but this article will consider ways to represent audio as a combination of various features. The Python libraries librosa and matplotlib are used for work. As the main source audio file, the opening melody of a regular chest from The Legend of Zelda: Breath of the Wild game will be used in wav format with a duration of ~1 second. The information presented in the article can be applied in the areas of speech-to-text, sound classification and other areas of audio analysis.

Import libraries and file upload:

import librosa
import librosa.display
import matplotlib.pyplot as plt

y, sr = librosa.load(file)

In this case, “y” will be an array of numbers containing the audio file itself in numerical form, and “sr” will be the sample rate, that is, the audio sampling rate.

With this data, it is already possible to represent the audio as a time series, where the Y-axis will be the signal amplitude. Note that, by default, librosa.load() reads the file in mono mode. Here is the code to render the necessary views:

fig, ax = plt.subplots(nrows=1, sharex=True, sharey=True)
librosa.display.waveplot(y, sr=sr, ax=ax)
ax.set(title="Monophonic")
ax.label_outer()

plt.show()
Figure 1 - audio signal as amplitude versus time
Figure 1 – audio signal as amplitude versus time

In figure 1, you can see the “bumps”. If you listen to the source file, it will become obvious that the number of “bumps” corresponds to the number of notes in the melody (seven), although the second note is “blurred”, which is noticeable on the graph. At the same time, accented notes, namely: the first, third and seventh, sound sharper, louder and this is reflected in the graph by a larger signal amplitude.

Now you can look at a more interesting presentation of audio. Following is its code:

S = np.abs(librosa.stft(y))
fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(S,
                                                       ref=np.max),
                               y_axis="log", x_axis="time", ax=ax)
ax.set_title('Power spectrogram')
fig.colorbar(img, ax=ax, format="%+2.0f dB")
plt.show()
Figure 2 - spectrogram of the original audio file
Figure 2 – spectrogram of the original audio file

A bit of theory to understand what the spectrogram shown in Figure 2 shows.

The Fourier transform decomposes the time function (signal) into frequency components. In the same way that a musical chord can be expressed by the volume and frequencies of its constituent notes, the Fourier transform applied to a function displays the amplitude of each frequency present in the underlying function (signal). This is exactly what we see in Figure 2. It has three scales – time to frequency, which is intuitive, and an additional decibel scale that shows the density, signal power at a certain frequency in a certain period of time. There is an alternative 3D representation that can help you understand spectrograms. It is shown in Figure 3:

Figure 3 - three-dimensional representation of a part of some piece of music
Figure 3 – three-dimensional representation of a part of some piece of music

So, if you add up all the “slices” of the three-dimensional representation and reflect the “saturation” of the areas with color, you will get a color two-dimensional representation as in Figure 2.

Let’s summarize the above: the spectrogram obtained by applying the Fourier transform to the audio signal allows you to “see” how certain frequencies are saturated in the audio signal.

The most illustrative, in the author’s opinion, is the comparison of the spectrograms of a recording of classical music with a light sound and some “heavy” genre with a high density and saturation of sound. Let, in this case, it will be “Tchaikovsky – Dance of the Pellet Fairy” (Figure 4) and “Megadeth – Symphony of Destruction” (Figure 5).

The images of the audio tracks in this case show that classical and metal are very different, at least in terms of volume: classic can be quiet and loud – places of calm and loud moments are clearly visible, while metal almost the entire composition sounds only loud.

The spectrograms, however, show that there are a lot of high notes in the “Dance of the Dragee Fairy”, there are quite a few low ones – their frequencies are less filled, and the composition is built more complicated than the standard “verse-chorus”. “Symphony of Destruction”, on the contrary, almost evenly fills all frequencies, drums and bass guitar “heat up” the low frequencies of the spectrogram, the composition contains many repeating parts.

Now, armed with the knowledge above, let’s move on to the analysis of probably the most popular representation of audio in machine learning – mel-cepstral coefficients.

Here, the graphs will bring little understanding, therefore, let’s immediately proceed to the theory:

If we generalize somewhat roughly, then we can come to the definition that a note is an oscillation of a certain frequency, so, for the first octave, it is an oscillation of 440 Hz. However, a person perceives sound imperfectly, the perception of the pitch of a sound depends not only on the frequency of vibrations, but also on the volume level of the sound and its timbre. It was for the analysis of audio, taking into account the peculiarities of human hearing, that the psychophysical unit of sound pitch was introduced – chalk. Fact from Wikipedia:sound vibrations with a frequency of 1000 Hz at an effective sound pressure of 2⋅10−3 Pa (that is, at a volume level of 40 phon), acting from the front on an observer with normal hearing, causes him to perceive the pitch, estimated by definition at 1000 mel. A sound with a frequency of 20 Hz at a volume level of 40 phon has, by definition, zero pitch (0 mel). The dependence is non-linear, especially at low frequencies (for “low” sounds).” Figure 6 shows the chalk-to-frequency scale:

Figure 6 - Graph of the dependence of the pitch in mel on the frequency of oscillation (for a pure tone)
Figure 6 – Graph of the dependence of the pitch in mel on the frequency of oscillation (for a pure tone)

Thus, in order to obtain a spectrogram that will reflect exactly how a person perceives sound, it is necessary to perform some transformations in order to obtain chalk-cepstral coefficients. Also, it should be noted that “some transformations” are arranged in such a way that greater detailing of features is provided in the low frequency region – the most used by humans. After all, the human ear, on average, hears in the range of 20-20000 Hz, while (let’s give one more fact from Wikipedia): “The voice of a typical adult male has a fundamental frequency (lower) of 85 to 155 Hz, of a typical adult woman from 165 to 255 Hz.“. The rest of the voice components are overtones. The same is true with musical instruments.

Let’s summarize the above: chalk-cepstral coefficients allow us to represent audio in a form that is closest to human perception of sound.

Using the librosa library, you can get the data in the desired form by simply calling the mfcc () function. The code for calling the function along with rendering the result is shown below, and the result of the code execution is shown in Figure 7.

mfccs = librosa.feature.mfcc(y=y, sr=sr)

fig, ax = plt.subplots()
img = librosa.display.specshow(mfccs, x_axis="time", ax=ax)
fig.colorbar(img, ax=ax)
ax.set(title="MFCC")
plt.show()
Figure 7 - MFCC for the source file
Figure 7 – MFCC for the source file

Unfortunately, the visualization shown in Figure 7, according to the author, is not as clear and informative as the spectrograms, but below are also visualizations for the “Dance of the Dragee Fairy” (Figure 8) and “Symphony of Destruction” (Figure 9):

Figure 8 - MFCC for
Figure 8 – MFCC for “Dance of the Dragee Fairy”
Figure 9 - MFCC for
Figure 9 – MFCC for “Symphony of Destruction”

Thus, the article considered the most commonly used ways of representing audio in the field of ML and analyzing audio as such. Also, a visual comparison of the spectrograms of two different pieces of music was carried out in order to demonstrate the differences in their set of features. The following are the materials used for the article, which are recommended for a deeper dive into the topic:

How to apply machine learning and deep learning methods to audio analysis

Audio Deep Learning Made Simple (Part 1): State-of-the-Art Techniques

Audio Deep Learning Made Simple (Part 2): Why Mel Spectrograms perform better

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *