Lyrics in IT, or how we learned to extract lyrics from songs. Sound Experience

In order for music streaming users to easily find songs by theme and meaning, and for the recommendation system to suggest the most suitable songs, a lyric extraction process is necessary. This involves automated extraction and subsequent analysis of song lyrics: from romantic ballads to disco hits. Moreover, it allows for effective filtering of content for different age groups.

My name is Dmitry Berestnev, I am Chief Data Scientist HiFi Streaming Soundand today I’ll tell you how we implemented lyric recognition.

The essence of the problem

According to statistics, only about 10% of labels and musicians upload song lyrics (along with the tracks themselves) to streaming services. Therefore, there is a need to decipher the lyrics yourself. However, there are no ready-made open-source solutions for automatic song lyrics deciphering on the market (Automatic Lyrics Transcription), and we at Sound decided to develop our own.

We immediately encountered problems: for full training of the model, there was a lack of high-quality data markup, texts, and time codes. There were no open-source datasets or ready-made neural networks for the task of automatic transcription of lyrics, since this task is specific and low-priority for speech recognition specialists. Therefore, we abandoned the idea of ​​​​training the model ourselves and began testing open-source solutions for speech2text, applying various settings to improve their efficiency in our context.

Description of the task

We needed to create a pipeline for sequential processing of a large number of audio files in batch mode with saving the results to internal data storage. It is important that the solution function effectively with both Russian and English languages.

To evaluate the accuracy of the model, we chose a target metric Mean Opinion Score (MOS)which is determined by an expert assessment from 1 to 5. The quality of the resulting markup should be high enough to understand the meaning of the song, that is, even if individual words are recognized incorrectly, the general meaning of the text should remain clear.

To objectively assess the quality of the system's operation, we decided to use a proxy metric. Word Error Rate (WER) — percentage of incorrectly recognized words. Current SoTA speech2text models give WER in the range of 5% to 10%. In the latest competition For Automatic Lyrics Transcription, the best WER results in English ranged from 11.5% to 24.3% for different datasets. Based on these data, we set the success threshold at 30%, which is quite realistic and achievable in our conditions.

In addition, the target speed of the model was to be no more than 8 seconds per song, where we consider one song as an audio file of 240 seconds, i.e. the required RTF = 8 / 240 = 0.033. This requirement takes into account the need for fast and efficient analysis of songs, so that the integration of the model into our processes is as productive as possible.

What we did

To implement the pipeline, we needed Airflow to orchestrate the process, Python with Pytorch to run the models and pipelines, some S3 to store temporary results, and a powerful video card so as not to wait a hundred years for the inference results. The pipeline consists of several stages: data preprocessing, model application, and postprocessing, which results in the extracted text. All stages are implemented as a pipeline in Airflow, operating in batch mode. The resulting texts are uploaded to files in Hive, which allows other teams, such as the recommendation team, to use them to create features and integrate them into their models.

To optimally apply speech2text models to songs, we decided to adapt the pipeline to the specific features of musical audio. An important difference between songs and ordinary speech is the presence of noise (accompaniment) in the background of singing and between phrases, as well as possible slurred pronunciation of words (thanks to mumble rap). Accordingly, to optimally apply a ready-made speech2text model to an audio track, it must first be preprocessed somehow.

To solve the problems of noise in the background and between the singing, we turned to the class Source Separation Modelswhich split the audio track into separate elements (stem'y). For this we used a popular model HTDemucswhich highlights vocals, bass, drums and other elements. We were interested in the vocal part. We tested other models, but they were inferior to HTDemucs in quality. With its help, we got rid of 99% of the noise.

Application of the HTDemucs model on the fragment “Wham! – Last Christmas”

To

To

After

After

The next step was to segment the audio track, which is a more familiar task for the speech recognition community. We used VAD (Voice Activity Detection) models, which determine at what points in time an audio track contains speech and at what points it does not. After experimenting with models from Silero And Pyannotewe settled on the latter option. As a result, using these two models sequentially, we got a cleared vocal without pauses, ready for transfer to the speech2text model.

Using VAD on a fragment of “Wham! – Last Christmas”

To

To

After

After

For the speech2text task, we tested open-source solutions from OpenAI (different versions Whisper) and Nvidia (NeMo Conformer). We quickly realized that small architectures weren't suitable (thanks mumble rap again), so we tested larger configurations of huggingface's Whisper v2 and v3. We tuned hyperparameters to achieve the best transcription quality, measured by WER, by experimenting with window parameters, beam search, and temperature.

Whisper Architecture

After receiving the results in the form of recognized text, we tested the use of error correctors based on models and regular expressions. For example, we used Google's T5 neural network model to understand and generate text.

We experimented with different combinations and settings of the solutions, which allowed us to create the perfect pipeline from HTDemucs, Pyannote VAD and Whisper V3 with the necessary settings and the inclusion of an error corrector.

It is important to note that we focused on songs in Russian and English. Whisper can recognize language (the prediction is in the first token of the decoder), so we ran our song database through a lighter version of Whisper with one decoder step in advance to determine the language of the song. Later, we recorded the language of the audio track coming to the input in the model.

The process of inference

Our Airflow has a Directed Acyclic Graph (DAG) configured to act as a coordinator for the process of extracting lyrics from audio tracks. This DAG involves several operations: it collects the IDs of songs for which lyrics need to be extracted and loads them into a volume specially designed for this purpose. This volume is attached to a cluster equipped with Nvidia A100 video cards. To optimize resources, each video card is divided into several isolated GPU instances (MIG – Multi-Instance GPU), which allows for more efficient load distribution between different inference tasks.

Then, within the DAG, the inference process is run, which uses an ensemble of our models to process the downloaded songs. This allows us to process large amounts of audio data in parallel and efficiently, minimizing latency and maximizing performance.

Once the inference process is complete, the results — the lyrics — are automatically uploaded to S3 storage. This ensures that the data is stored securely and safely, and is available for further processing. From S3, the data is sequentially transferred to Hive, where it becomes available for analysis and use by other teams within the company to create features or feed recommendation systems.

Specifics of music

Based on experiments with 5,000 songs (2,000 hours of audio) in various genres, we found that the best word recognition results are achieved in the Pop, New Wave, Pop Rock, House, Disco and Chanson genres, where the WER is from 10% to 20%. This is because in these genres the music is unobtrusive, and the lyrics are clear and understandable, which makes it easy to sing along. The Non-music category also showed good results (which is not surprising), including podcasts, audio plays and radio shows.

The least effective word recognition was in the Techno, Electro, DnB and Metal genres, where the WER exceeds 40%. These genres are characterized by aggressive and heavy music, as well as various voice effects, which makes it difficult for even a human to understand the words.

In addition, English is traditionally recognized better than Russian, which is due to the features of the models themselves. In the case of bilingual songs, where several languages ​​are present at the same time, we used an error corrector and post-processing, which significantly improved the recognition results.

Result

As a result, we adapted and modernized existing open-source solutions for working with music and created our own ML service deployed in our internal Kubernetes cluster. This service analyzes music files in batch processing mode, extracts lyrics from them, and uploads the results to an internal data storage.

We have achieved important indicators: recognition accuracy is characterized by the metrics Mean WER = 24.5%, MOS = 3.665, and RTF = 0.0125 (almost three times less than required, where the speed of operation is 80 seconds of audio per 1 second of inference). The Sound database today contains more than 70 million music tracks, about half of which contain vocals. We have already marked a third of the songs, but about 150 thousand new tracks are added every week. Thus, we have to process at least 22 million more compositions, which we plan to do in the near future.

Next, we plan to optimize the lyric extraction process, increasing the processing speed and improving the quality of text recognition for the most complex categories of music content, such as Metal. In addition, thanks to the additionally collected texts, we will refine and rebuild our recommendation algorithms to improve their quality.

We are also developing a situational search feature, allowing users to search for content related to specific queries, such as a soundtrack for a party or a vacation. Each track will be provided with a list of tags, helping to relate it to a specific situation.

Thank you for reading! We welcome your questions.

PS: Special thanks to the ML Audio team, namely Anton Nesterenko and Amantur Amatov, for the work done and help with the article.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *