Classifying audio files with the Librosa library

We can say that audio is the physical representation of sound whose frequency ranges from 20 Hz to 20 kilohertz. These sounds are available in many formats, allowing the computer to parse them, such as mp3, wma, wav.

To work with an audio signal, it must be digitized, i.e. convert the sound wave into a series of numbers. This is done by measuring the amplitude of the sound at fixed intervals. Each such measurement is called a sample, and the sample rate is called the number of samples per second. For example, a typical sampling rate is about 44,100 samples per second. This means that a 10 second music video will contain 441,000 samples.

Thanks to audio sampling, a sufficiently large number of different characteristics can be extracted from audio recordings, which help in further audio analysis. Among such characteristics, one can distinguish, for example, mel-cepstral coefficient (MFCCs), spectrum, spectrogram, spectral centroid (Spectral Centroid), spectral decay (Spectral Rolloff), etc.

To understand these characteristics in more detail, let’s install the Librosa library, which is used for analyzing audio signals, but is more focused on music.

pip install librosa

or

conda install -c conda-forge librosa

And here I want to make a small digression. I work in a Jupyter Notebook and it often happens that you type these magic commands and, …. nothing works. And then it starts, googling, trying various options that promise to fix the installation problem, trying to reinstall everything again, create a new virtual environment, and still nothing works. It was with this library that I had a lot of problems, at different points in time various errors appeared and nothing worked. But then, in some magical way, Librosa was still installed, moreover, I don’t even know what exactly helped me.

But there is an assumption that the team

.pip install librosa –-user turned out to be magical.

Now you can import the library and start working.

import librosa

We will classify audio files according to a data set containing a collection of audio files in 10 different genres. Each genre has 100 audio files lasting 3 and 30 seconds.

Loading Audio and Extracting Significant Characteristics

Using the function librosa.load() we can read specific sounds from our dataset.

import os # библиотека для работы с файлами
dir="***/datasets/Data/genres_original" #задаем директорию с данными
file = dir+'/blues/blues.00000.wav'
signal, sr = librosa.load(file, sr = 22050) # загружаем файл

At the output, we get two objects, the first is a digital representation of our audio signal (in the form of a time series), the second is the corresponding sampling frequency at which it was extracted. The default is 22050 Hz oversampling. As mentioned above, the sampling rate is the number of audio samples transmitted per second, measured in Hz or kHz.

print(signal.shape, sr)
(661794,) 22050

Let’s look at our audio signal

print(signal)
[ 0.00732422  0.01660156  0.00762939 ... -0.05560303 -0.06106567
 -0.06417847]

The received audio signal can be represented as a sound wave using the function librosa.display.waveshow()

import matplotlib.pyplot as plt
import librosa.display as ld
plt.figure(figsize=(12,4))
ld.waveshow(signal, sr=sr)

The vertical side represents the amplitude of the sound, while the horizontal axis represents the time taken to reproduce the sound at a particular frequency.

And with the help of the IPyhon.display() function, we will get a player in a notebook where we can play the audio file.

import IPython
display(IPython.display.Audio(signal, rate = sr))

Sound can be translated from the time domain to the frequency domain using the Fast Fourier Transform in such a way that we then get signal spectrum. There is a function in librosa for this purpose librosa.stft() with options like n_fft is the length of the window signal after padding with zeros and hop_length – frame size or fast Fourier transform size.

n_fft = 2048
ft = np.abs(librosa.stft(signal[:n_fft], hop_length = n_fft+1))
plt.plot(ft)
plt.title('Spectrum')
plt.xlabel('Frequency Bin')
plt.ylabel('Amplitude')

Spectrogram is a visual way of representing the level or loudness of a signal at various frequencies present in waveforms. That is, it shows the intensity of frequencies over time. This gives the representation of time on the x-axis, frequency on the y-axis, and the corresponding amplitudes are represented by color

Using the function librosa.display.specshow() You can look at the spectrogram of the audio signal:

X = librosa.stft(signal)
s = librosa.amplitude_to_db(abs(X))
ld.specshow(s, sr=sr, x_axis="time", y_axis="linear")
plt.colorbar()

Mel-cepstral coefficients (MFCC) are one of the most important features in audio processing. The process of calculating these coefficients takes into account a number of features of the human auditory analyzer, simulating the characteristics of the human voice. This is due to the fact that the sounds produced by a person are determined by the shape of the vocal tract, including the tongue, teeth, etc.

Mel-cepstral coefficients can be found using the function librosa.feature.mfcc()

mfccs = librosa.feature.mfcc(y=signal, sr=sr, n_mfcc = 40, hop_length=512)
mfccs
mfccs.shape
(40, 1293)

n_mfcc – number of MFCC coefficients

hop_length – frame size

The size of the matrix of chalk coefficients is calculated as

[n_mfcc, len(signal)//hop_length+1]

That is, if n_mels = 40, hop_length = 512Then

len(signal)//hop_length+1 = 661794//512+1 = 1292+1 = 1293.

We can also build a chalk spectrogram using the function librosa.feature.melspectrogram(). The chalk spectrogram is a spectrogram converted to chalk scale.

melspectrum = librosa.feature.melspectrogram(y=signal, sr = sr,
                                        	hop_length =512, n_mels = 40)

Spectral Centroid is a good indicator of sound brightness, widely used as an automatic measure of musical timbre. That is, the centroid shows where the center of mass of sound is located. In blues compositions, the frequencies are evenly distributed, in metal, the spectroid lies closer to the end of the spectrum. Librosa uses the function librosa.feature.spectral_centroid()

cent = librosa.feature.spectral_centroid(y=signal, sr=sr)
plt.figure(figsize=(15,5))
plt.semilogy(cent.T, label="Spectral centroid")
plt.ylabel('Hz')
plt.legend()

array([[1936.83283904, 1820.36294357, 1780.31673025, ..., 2770.21094705, 2661.92181327, 2604.75205139]])

Spectral Rolloff represents the frequency below which lies a certain percentage of the total spectral energy. librosa uses the function librosa.feature.spectral_rolloff()

rolloff = librosa.feature.spectral_rolloff(y=signal, sr=sr)
plt.figure(figsize=(15,5))
plt.semilogy(rolloff.T, label="Roll-off frequency")
plt.ylabel('Hz')
plt.legend()

array([[4005.17578125, 3520.67871094, 3348.41308594, ..., 5792.43164062, 5577.09960938, 5361.76757812]])

Zero crossing rate is the frequency of the sign change of the signal, that is, the frequency with which the signal changes from positive to negative and vice versa. For example, for metal and rock, this parameter is usually higher than for other genres, due to the large number of drums. librosa uses the function librosa.feature.zero_crossing_rate()

zrate=librosa.feature.zero_crossing_rate(signal)
plt.figure(figsize=(14,5))
plt.semilogy(zrate.T, label="Fraction")
plt.ylabel('Fraction per Frame')
plt.legend()

array([[0.03808594, 0.06054688, 0.07861328, ..., 0.14550781, 0.13623047, 0.10058594]])

Of course, this is not the whole list of significant characteristics of the audio signal, and usually each researcher chooses which characteristics to extract from the audio file he will use in his task.

For example, you can also select the averages and standard deviations of the mel-cepstral coefficients:

mfcc_mean = np.mean(librosa.feature.mfcc(y=signal, sr=sr), axis=1)
mfcc_std = = np.std(librosa.feature.mfcc(y=signal, sr=sr), axis=1)

means and standard deviations of the spectral centroid

cent_mean = np.mean(cent)
cent_std = np.std(cent)

mean values ​​and standard deviations of the spectral decay, etc.

roloff_mean = np.mean(roloff)
croloff_std = np.std(roloff)

Then all these values ​​\u200b\u200bcan be written to the dataframe and work with them

df = pd.DataFrame(audio_data)
df['labels'] = labels

For further analysis, it is necessary to extract the characteristics from all audio files. To do this, for example, we can get the average chalk-cepstral coefficients for all of our audio files. First we create a list audio_files with the names of the files of all compositions and their corresponding labels labels of the genre type:

audio_files = []
labels = []
labelind = -1
for label in os.listdir(dir):
	labelind +=1
	label_path = os.path.join(dir, label)
	for audio_file in os.listdir(label_path):
    	audio_file_path = os.path.join(label_path, audio_file)
    	audio_files.append(audio_file_path)
    	labels.append(labelind)

Now let’s create a function that will take an audio file as input, and then get the average values ​​of the coefficients for this file.

def preprocess_audio(audio_file_path):
	audio, sr = librosa.load(audio_file_path)
	mfcc_mean = np.mean(librosa.feature.mfcc(y=audio, sr=sr),   axis=1)
	return abs(mfcc_mean)

Get a list audio_data digital values ​​for all audio files:

audio_data =[]
for audio_file in audio_files:
	mfccs_mean = preprocess_audio(audio_file)
	audio_data.append(mfccs_mean)

Let’s create arrays from the characteristics of audio files and their labels

audio_data = np.array(audio_data)
labels = np.array(labels)

Thus, it is possible to combine all the characteristics of audio files into one dataframe and then work with it.

Model building

The dataset I’m working with already contains a csv file that contains information about mel-cepstral coefficients, spectral centroids, etc. Let’s download and take a look at it.

df = pd.read_csv(f'{dir}/features_3_sec.csv')
df

Our dataframe has 2 columns (filename And length), which we will not need further, so we will remove them.

df = df.iloc[0:, 2:]

Now let’s define a vector with training data (X) and a vector of corresponding labels (y). The training data is a meaningful feature of the audio data, with a total of 57 values. Let’s deal with class labels.

df['label'].unique()
array(['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz',
       'metal', 'pop', 'reggae', 'rock'], dtype=object)

They have a categorical representation, so we convert them to a numerical representation using the method LabelEncoder().

class_list=df.iloc[:,-1] # создаем список классов
convertor = preprocessing.LabelEncoder()
y=convertor.fit_transform(class_list) # конвертируем признаки

Now let’s move on to the data for training X. Let’s remove the column with labels from our dataframe.

X = df.loc[:, df.columns !='label']

We normalize our target vector using the method StandardScaler()

from sklearn import preprocessing
cols = X.columns
scaler = preprocessing.StandardScaler()
np_scaled = scaler.fit_transform(X)
X = pd.DataFrame(np_scaled, columns = cols)

Divide the sample into test and training

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42 )

Let’s use classical machine learning algorithms: Bayes algorithm, logistic regression, k-nearest neighbors, support vector machines, ensemble methods: random forest, XGBoost (gradient boosted trees) and multilayer persptron model MLPClassifier.

Import the necessary modules:

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve

Let’s create an additional function for working with learning algorithms:

def model_assess(model, title = "Default"):
	model.fit(X_train, y_train)
	preds = model.predict(X_test)
	print('Accuracy', title, ':', round(accuracy_score(y_test, preds), 5), '\n')

We will build models and evaluate the accuracy.

# алгорит Баейса
nb = GaussianNB()
model_assess(nb, "Naive Bayes")
# алгоритм k-ближайших соседей
knn = KNeighborsClassifier(n_neighbors=10)
model_assess(knn, "KNN")
# метод опорных векторов
svm = SVC(decision_function_shape="ovo")
model_assess(svm, "Support Vector Machine")
# логистическая регрессия
lg = LogisticRegression(random_state=0, solver="lbfgs", multi_class="multinomial")
model_assess(lg, "Logistic Regression")
# случайный лес
rforest = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0)
model_assess(rforest, "Random Forest")
# многослойный персептрон
nn = MLPClassifier(solver="lbfgs", alpha=1e-5, hidden_layer_sizes=(5000, 10), random_state=1)
model_assess(nn, "Neural Nets")
# деревья с градиентным бустингом
xgb = XGBClassifier(n_estimators=1000)
model_assess(xgb, "XGBClassifier")

We see that the lowest accuracy indicator is 0.52 for the Bayes algorithm, and the highest 0.9 for the algorithm XGBClassifier.

Audio song recommendations

And now let’s try to recommend audio compositions to the user using the method cosine_similarity() from the library scikit-learn. This method calculates the cosine similarity between two non-zero vectors and is based on the cosine of the angle between them, which gives a value between -1 and 1. A value of -1 means the vectors are opposite, 0 represents orthogonal vectors, and a value of 1 means similar vectors.

To do this, take a csv file, read it and remove the extra columns.

df1 = pd.read_csv(f'{dir}/features_30_sec.csv',index_col=0)
labels = df1[['label']]
df1 = df1.drop(columns=['length','label'])

Next, we translate our dataframe into a 1000×57 matrix and calculate the cosine similarity between the vectors.

from sklearn import preprocessing
from sklearn.metrics.pairwise import cosine_similarity
scaled=preprocessing.scale(df1)
similarity = cosine_similarity(scaled)

And now we represent the resulting similarity matrix as a dataframe:

similarity_labels = pd.DataFrame(similarity)
similarity_names = similarity_labels.set_index(labels.index)
similatity_names.columns = labels.index

Now, based on the received dataframe, we can recommend compositions by choosing the names of those files whose values ​​are close to 1.

name="rock.00087.wav"
series = pd.DataFrame(similarity_names[name].sort_values(ascending = False))
series = series.loc[(series[name]>0.90)]
series = series.drop(name)
print("\n*******\nSimilar songs to ", name)
print(series.head(5))

We get the names of the recommended compositions and their cosine similarity.

In the future, it would be interesting to obtain images from audio files in the form, for example, of their spectrograms or chalk-cepstral spectrograms, and then classify audio files using neural network algorithms. Thus, we will be able to work with sound, not using the classical approach, which consists in converting data, but we will work with a visual representation of sound. Classifying audio files with the Librosa library.

And finally, I want to recommend you a free lesson from my colleagues from OTUS. At the lesson, OTUS teachers will talk about recommender systems based on content filtering. And then you will apply the learned approaches in practice to build a recommendation system for an online store.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *