How I created my presentation deepfake

I recently came in handy with my open-source project Wunjo AI to create your own deepfake – a synthesis of English speech with a minimal accent. In this article, I want to tell How did I achieve this why I did it, and demonstrate the result. You will learn how deepfakes can enrich the content creation process. Later, you can decide for yourself whether deepfakes and speech synthesis can be useful for you.

What is the essence of the problem?

I am attending a conference to prepare a video presentation of a research paper on Study N. In the process of preparation, I encountered a number of difficulties, which we will discuss below.

Problem 1: Speech and sound

When recording my voice, there was a problem with poor sound quality and a noticeable accent, which made speech difficult to understand. Although the sound can be improved with processing, it is unlikely that the mispronunciation of words can be corrected (however, there is an idea about this below). The first step was to synthesize speech from text using a voice synthesizer that I trained. If you are interested in creating your own model based on your voice or another English voice without an accent, this is for you. Video instruction. I integrated my model into Wunjo AI and spent about 15-20 minutes to turn the entire text of the presentation into synthesized speech. It took me about a day to train the model, but since it was already trained in advance, I did not need to repeat this process. Recording and processing your own voice would take much more time than speech synthesis.

Comparison results

Comparing my voice and synthesized speech, the difference is felt like heaven and earth. Now I think it would be worthwhile to clone my voice using a small fragment of the processed speech. This way, the speech sound would keep my voice without an accent, and the encoder, synthesizer and vocoder for English would be in the public domain. I liked the idea so much that I added voice cloning to Wunjo AI this weekend. You can expect it in version 1.5 or already use the v1.5 branch on GitHub.

Problem 2: Human video

I decided to create a video with a talking person in the frame, because I like video presentations where you can see a talking person, and not just slides with text. Creating a quality video with a person requires an investment of effort: the right lighting, the right atmosphere and the right face that knows how to stay in the frame. Unfortunately, I didn’t have any of that. I decided not to film myself (it would look something like like here at 37 secondswell, not really), but instead look for a suitable one on YouTube, for example, a browser video Chris Tomshack with a face shape similar to mine, since the discrepancy between face shape and size will be noticeable on the deepfake.

Deepfake in which there is a discrepancy between the shape of the face and the size

Deepfake in which there is a discrepancy between the shape of the face and the size

Video processing

Since I searched YouTube videos asking for video reviews of technological innovations as a basis, the talking people in the frame often appeared less time than the subject of the review. I selected the longest segments of the human video, reversed the video, and connected it to the source to make a loop. I then lip-synced the audio speech with Wunjo AI to improve the face, since the speech was longer than the segment itself, the video was just in a loop and the sharp frame transition was not noticeable. This made it possible to achieve better results.

Deepfake lip movement

Deepfake lip movement

Correction of defects

Sometimes there were problems with the wrong position of the lips and, for example, a double chin at the cut. I split the video into frames using ffmpeg and corrected the defective frames using retouching. This stage took a little time, approximately manually finished 40-60 frames.

Video face replacement

I created my deepfake by taking only one photo of my face. I used my photo taken on the front camera and in Wunjo AI 1.5 overlayed it on the synthesized video, then improved the quality of the face to HD. The result came out good for me, if you don’t know it’s a deepfek, then it’s hard to notice.

Deepfake face swap from photo

Deepfake face swap from photo

What do we have?

The final speech and deepfake video took me about 2 hours + 30 minutes to add slides and a video of a speaking person. If desired, for a longer video, a deepfake of emotions could be added to some speech segments.

I hope you find this article helpful on how to use deepfake and speech synthesis for content creation, presentations, and work. I plan to create a video tutorial on how to work with all the deepfakes in the app on my YouTube channel after the release of version 1.5, where all of the above will be available (except for manual retouching). See you soon!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *