Photo Erik-jan leusink / Unsplash
In December 2019, specialists from the Munich Technical University and the Max Planck Society’s Institute of Informatics published a scientific work on the system Neural Voice Puppetry.
To generate a video recording, she only needs an audio file with a person’s voice and his photo. Process consists of three stages. First, a recurrent neural network analyzes recorded speech and builds logit modelreflecting the pronunciation of the speaker. It is sent to a generalizing neural network, which calculates the coefficients for building a three-dimensional model of the face. Next, a render module comes into play, which generates the final record.
The developers say that Neural Voice Puppetry plays high-quality videos, but they still have to solve some problems associated with audio synchronization.
Similar technology are developing engineers from Nanyang University in Singapore. Their system allows you to combine the recording of the speech of one person with the video of another. First of all, it forms a 3D model of the face for each frame on the target video. Further, the neural network analyzes key facial points, and modifies the three-dimensional model so that its expressions coincide with the phonemes of the original audio file. According to the authors, their tool surpasses analogues in quality. During blind tests, respondents marked 55% of records are “real”.
Where to apply
In the future, dipfakes will allow creating realistic video avatars – personalities for voice assistants. In 2017, enthusiast Jarem Archer implemented Cortana Assistant from Windows 10 as a hologram. Artificial intelligence systems for the formation of dipfakes will take such solutions to a new level. One more application area of such algorithms is the gaming industry. Generating facial animations by soundtrack will simplify the work of game designers who customize the facial expressions of virtual characters.
Developers of diphake technology note that their systems are just a tool. And unfortunately, it will inevitably be used for illegal purposes. First such crime It was committed in 2019. The director of an English energy company transferred $ 240,000 to a fraudster. He imitated the voice of the head of the concern from Germany using neural networks and asked to complete the transaction. Therefore, experts from universities are actively working with law enforcement agencies and politicians to prevent such situations. For example, University of Colorado at Denver is developing recognition tools for fake audio and video recordings. In the future, there will only be more such projects.
What other projects are there
There are tools that allow editing audio recordings is as easy as plain text. For example, Descript offers an audio editor that transcribes speaker words and allows you to edit them in text form. You can add pauses, rearrange the fragments in places – all edits are synchronized with the audio recording. The developers say that the system processes files in .m4a, .mp3, .aiff, .aac and .wav, and the accuracy of transcription exceeds 93%.
Photo Yohann libot / Unsplash
Other projects appeared at the same time as Descript. Engineers at Princeton University submitted “Photoshop for audio” – the VoCo system. It allows not only editing records in text form, but also synthesizing phrases with the speaker’s voice (taking into account intonations).
In the future, such services will be useful to reporters and media companies that create audio content. They will also help people with specific diseases who communicate using speech synthesis systems. VoCo and its counterparts will make their voice less “robotic.”
Additional reading in our blog “World of Hi-Fi”:
Bitchy Betty and the audio interfaces: why they speak in a female voice
Audio Interfaces: Sound as a source of information on the road, in the office and in the sky
The world’s first “gender-neutral” voice assistant
The history of speech synthesizers: the first mechanical installations
How speech synthesis appeared on PC