How AI Systems Aim to Simplify Sound Engineering

This weekend, we decided to talk about the developments of two American universities, which help to generate a sufficiently believable sound scale for silent videos.

^{Photo Free To Use Sounds / Unsplash}

The difficult task of the noisemaker

Sounds for films and TV shows – for example, the rustle of rain – is very difficult to record in the right way right on the set at the time of shooting a particular fragment. There will be a lot of extraneous noise, conflicts with the voices of actors and other equipment are possible. For this reason, almost all sounds are recorded separately and mixed during editing. Doing this noisemakers…

If a movie needs to reproduce the sound of a broken window, then the sound designers go to the studio and start breaking glass under controlled acoustic conditions. The recording is carried out until the sound coincides with what is happening on the screen. In particularly difficult cases, this may require dozens of iterations, which complicates and increases the cost of filmmaking.

University of Texas Engineers offered Alternative option. They developed an AI system that detects what is happening in the frame and automatically suggests a scale.

How it works

The engineers described the operating principle of the system in their work for the IEEE (PDF). They designed two machine learning models. The first one extracts features of images from the footage – for example, color. The second model analyzes the movement of an object in different frames and determines its nature in order to select the appropriate sound.

To form the acoustic array, engineers developed AutoFoley program. It generates a new sound based on thousands of short audio samples – with the sound of rain, the ticking of a clock, a galloping horse. The result of the work is quite convincing:

Unfortunately, the system has a number of serious limitations so far. It is suitable for processing recordings in which the sound does not have to match the video perfectly. Otherwise, desynchronization becomes noticeable – as in this video… Also, the object must be constantly present in the frame so that the MO model can recognize it. Now the developers are engaged in patent registration, but then they plan to fix the flaws.

Who else is involved in such projects

In 2016, specialists from MIT and Stanford presented a machine learning model capable of sounding silent video. It predicts sound based on a property of an object in the frame – for example, its material. As an experiment, engineers uploaded a video to the system in which a person hits a drum stick on various surfaces: metal, earth, grass and others.

The developers assessed the effectiveness of the algorithm using an online survey. The most realistic were the sounds of leaves and dirt (they were called real by 62% of the respondents), and the least – wood and metal. Metal sounded natural only 18% of the time.

This system also needs to be improved. It generates sounds that occur when objects collide, but cannot recreate the acoustic array for wind noise. In addition, the algorithm fails if objects are moving too fast. Despite this fact, such solutions have the potential – they can simplify the work of noise-makers and transform the film industry.

Additional reading on Hi-Fi World:

Moviegoer Horror: Remastered and Dubbed
Who chooses music for movies and TV shows? Music Supervisor
“Oh no, again”: music in movies and TV shows that we hear too often
Rain, clanking armor and liquid metal: how sound is created for cinema
“Sound shop”: How to create sound design for cinema