Machine learning in Google’s Hum to Search
Obsessive melodies (English earworms) is a well-known and sometimes annoying phenomenon. Once one of these gets stuck in the head, it can be quite difficult to get rid of it. Research has shownthat the so-called interaction with the original compositionwhether listening to it or singing it helps to chase away the haunting melody. But what if you can’t remember the name of the song, but can only hum the tune?
When using the existing methods of comparing the melody we are trying to reproduce with its original polyphonic studio recording, a number of difficulties arise. The sound of a live or studio recording with lyrics, backing vocals and instruments can be very different from what we can hum. In addition, by mistake or by design, our version may have completely different pitch, key, tempo or rhythm of the song. That’s why there are so many current approachesapplied to the system query by humming, match the sung melody to a database of pre-existing melodies or other melodies of that song, rather than identifying it directly. However, this type of approach is often based on a limited database that requires manual updating.
Feature Launched in October Hum to search is a new fully machine-learning Google Search system that allows a person to find a song if he sings or rushes it. Unlike existing methods, this approach creates an embedding from the spectrogram of the song, bypassing the creation of an intermediate representation. This allows the model to compare our melody directly to the original (polyphonic) recording without having to have a different melody or MIDI version of each track, or use complex hand-crafted logic to extract the melody. This approach greatly simplifies the database for Hum to Search, allowing you to constantly add embeddings of original tracks from around the world, even the latest releases, to it.
How it works
Many existing music recognition systems convert it into a spectrogram before processing an audio sample to find a more correct match. However, there is one problem in recognizing a sung melody – it often contains relatively little information, like in this example songs “Bella ciao”. The difference between the sung version and the same segment from the corresponding studio recording can be visualized with spectrogramsshown below:
Visualization sung snippet and his studio recording
Given the image on the left, the model must find the audio that matches the image on the right in a collection of over 50 million similar images (corresponding to segments of studio recordings of other songs). To do this, the model must learn to focus on the dominant melody and ignore the backing vocals, instruments and timbre of the voice, as well as differences arising from background noise or room reverberation. To determine by eye the dominant melody that could be used to compare the two spectrograms, you can look for similarities in the lines at the bottom of the above images.
Previous attempts to implement music search, particularly in the context of recognizing music playing in cafes or clubs, have demonstrated how machine learning can be applied to this problem. Now Playing, released in 2017 for Pixel phones, uses a built-in deep neural network to recognize songs without the need for a server connection, while Sound Search, who later developed the technology, uses it for server-based recognition to find over 100 million songs faster and more accurately. The next challenge was to apply what was learned in these releases to recognize music from a similarly large library, but from the sung passages.
Setting up machine learning
The first step in the evolution of Hum to Search was to change the music recognition models used in Now Playing and Sound Search to work with recordings of melodies. Basically, many similar search engines (like image recognition) work in a similar way. For training, the neural network receives as input a pair (melody and the original recording) to create an embedding for each input, which will later be used to match with the sung melody.
Neural network training setup
To ensure recognition of what we are singing, the network must create embeddings for which audio pairs containing the same melody are located close to each other, even if they have different instrumental accompaniment and singing voices. Audio pairs containing different melodies must be far apart. During training, the network receives such audio pairs until it learns to create embeddings with this property.
Ultimately, the trained model will be able to generate embeddings for our tunes, similar to the embeddings of master recordings of songs. In this case, finding the right song is just a matter of searching the database for similar embeddings calculated on the basis of audio recordings of popular music.
Since training the model required pairs of songs (recorded and sung), the first challenge was to get enough data. Our original dataset consisted mostly of sung snippets (very few of them contained just a hum of a motif without words). To make the model more reliable, during training, we applied augmentation to these fragments, namely, we changed the pitch or tempo in a random order. The resulting model worked well enough for examples where the song was sung rather than hummed or whistled.
To improve the model’s performance on wordless melodies, we generated additional artificial hum training data from the existing audio dataset using SPICE, a pitch extraction model developed by our extended project team FreddieMeter… SPICE extracts pitch values from a given sound, which we then use to generate a melody consisting of discrete sound tones. The very first version of this system transformed this original passage behold in it…
Generating a hum from a sung audio clip
We later refined our approach by replacing a simple tone generator with a neural network that generates sound that resembles a real hum of a motif without words. For instance, sung above fragment can be converted to such hum or whistling…
In the last step, we compared the training data by mixing and matching audio snippets. When, for example, we came across similar fragments from two different performers, we aligned them with our preliminary models and therefore provided the model with an additional pair of audio fragments of the same melody.
Improving machine learning
When teaching the Hum to Search model, we started with triplet loss functions… As shown, it copes well with various classification tasks such as image classification or recorded music… If given a pair of audio matching the same melody (the R and P points in the nesting space shown below), the triplet loss function will ignore certain parts of the training data obtained from the other melody. This helps to improve the learning behavior both when the model finds another melody that is too simple and already far from R and P (see point E), and when it is too complex given the current stage of model training, and turns out to be too close to R – but according to our data, it represents a different melody (see point H).
Examples of audio segments rendered as points in space for embedding
We found that we can improve the accuracy of the model by taking into account these additional training data (points H and E), namely by formulating the general concept of model confidence in a series of examples: how confident the model is that all the data it sees can be classified correctly. or does she see examples that do not match her current understanding? Based on this concept, we added a loss that brings the model closer to 100% confidence in all areas of the built-in space, resulting in improving the accuracy and memorability of our model…
The aforementioned changes, in particular the variation, augmentation and combination of training data, allowed the neural network model used in Google search to recognize sung or sung tunes. The current system achieves a high level of accuracy with a database of over half a million songs that we constantly update. This collection of songs still has room to grow to include more music from around the world.
To test this feature, open the latest version of the Google app, click on the microphone icon and say “What’s this song” or click on “Search a song”. Now you can hum or whistle a melody! We hope Hum to Search will help you get rid of obsessive melodies or just find and listen to a track without entering its name.