How will we live in a world where everything can be fake? Personal experience of a person who received a digital double

Hello, this is Sherpa Robotics. Today we have translated for you an article by Melissa Heikkilä, a senior reporter at MIT Technology Review, where she covers the topics of artificial intelligence and how it is changing our society. To write this article, Melissa went through the process of creating a digital twin with the help of a startup called Synthesia.

Synthesia's new technology is impressive, but it raises serious questions about a world where it is increasingly difficult to distinguish reality from fiction. An artificial intelligence startup has created a hyper-realistic deep fake of the article's author that's so lifelike it's frightening.

In the article, Melissa discusses how to distinguish reality from fiction in the era of synthetic media and how this will affect our lives in the future.

I'm nervous and late. After all, what to wear for eternity? It sounds like I'm dying, but it's the other way around. I'm, in a sense, going to live forever, thanks to AI video startup Synthesia. The company has been creating avatars using artificial intelligence for several years, and in April it introduced an update, the first to use the latest advances in generative artificial intelligence. The new version is incredibly realistic and expressive like nothing else I've seen. Since the new release will allow virtually anyone to create a digital double of themselves, the company agreed to create a digital version of me in early April, before the technology became available to the public.

When I finally arrived at the company's stylish studio in east London, I was greeted by Tosin Oshinemi, head of production. He will guide and guide me through the data collection process – and by “data collection” I mean capturing my facial features, mannerisms, etc. — just like he usually does for Synthesia actors and clients. He introduces me to the stylist and makeup artist, and I kick myself for spending so much time getting ready. Their job is to make sure that people are dressed to look good on camera and that their appearance is consistent from shot to shot. The stylist says my outfit is fine (thank God) and the makeup artist corrects my makeup.

The dressing room is decorated with hundreds of smiling photos of people who were “cloned” using this technology before me. With the exception of a small supercomputer whirring in the hallway, crunching data generated in the studio, the whole experience is more like walking into a news studio than visiting a doppelgänger factory.

I joke with Oshinemi that MIT Technology Review might call the job title “Director of Deepfakes.”

“We prefer the term 'synthetic media' rather than 'deepfake,'” he says.

This is a subtle but, some would argue, significant difference in semantics. Both terms refer to AI-generated videos or audio recordings of people doing or saying something that did not necessarily happen in real life. But deepfakes have a bad reputation. Since its inception almost ten years ago, the term has come to mean something unethical, says Alexandru Voicea

Synthesia, a startup working on creating avatars using artificial intelligence, says its new technology can create images of people so realistic that they are virtually indistinguishable from the real thing. But how ethical and safe is this technology?

Thanks to rapid advances in generative AI and the abundance of training data created by actors and provided to the model, Synthesia has been able to create avatars that are more realistic and expressive than their predecessors. Digital clones are better at matching their reactions and tone to the mood of the script, speaking more optimistically about happy things and becoming more serious and sad when talking about unpleasant things. They are also better at picking out facial expressions—those subtle movements that can tell us about us without words.

However, this technological advance also signals a much larger social and cultural shift. More and more of what we see on our screens is generated (or at least edited) by AI, and it is becoming increasingly difficult to distinguish reality from fiction. This undermines our trust in everything we see, and this can have very real and dangerous consequences.

“I think we're just going to have to say goodbye to getting truthful information quickly,” says Sandra Wachter, a professor at the Oxford Internet Institute who studies the legal and ethical implications of AI. “The idea that you can just quickly Google something and find out what is fact and what is fiction is fundamentally wrong.”

So while I was excited that Synthesia would make a digital double of me, I also wondered if the distinction between synthetic media and deepfakes really mattered. Even if the former focus on the intent of the creator and, importantly, the consent of the subject, is there really a way to make AI avatars safe if the end result is the same? And do we really want to leave the Uncanny Valley if it means we can no longer tell what is the truth and what is the avatar?

A month before the studio trip, I visited Victor Riparbelli, Synthesia's CEO, at his office near Oxford Circus. According to Riparbelli, Synthesia began with his fascination with avant-garde, geeky techno music while growing up in Denmark. The Internet allowed him to download software and create his own songs without having to buy expensive synthesizers. “I think it’s right to give people the opportunity to express themselves the way they want, we deserve this kind of world,” he says. He saw an opportunity to do something similar with video when he came across research on using deep learning to transfer facial expressions from one person to another on screen.

Synthesia, a company that specializes in creating avatars using artificial intelligence, has made significant progress by creating videos that are virtually indistinguishable from reality. But what does this mean for the future of content and the way we consume information?

Synthesia is a startup from Europe that managed to attract investment and receive a valuation of more than $1 billion. This is one of the few European AI startups to achieve such success. The first generation of Synthesia avatars were clunky, with repetitive movements and little variety. In subsequent versions, the avatars became more human, but were still unable to speak complex words, and their movements were sometimes out of sync.

The problem is that people are used to looking at other people's faces. “We know what real people look like,” says Jonathan Stark, CTO of Synthesia. Since childhood, we have been “tuned to people and their faces. Even the slightest inaccuracy is noticeable.”

These early AI-generated videos, like deepfakes in general, were created using generative adversarial networks, or GANs, an older technology for generating images and videos that uses two neural networks that play against each other. It was a labor-intensive and complex process, and the technology was unstable.

But during the generative AI boom last year, the company discovered it could create better avatars by using generative neural networks that more consistently produced higher-quality images. The more data these models are fed, the better they learn. Synthesia uses both large language models and diffusion models to do this; the former help avatars react to the script, and the latter generate pixels.

Despite the leap in quality, the company still does not position itself as a player in the entertainment market. Synthesia is betting that as people spend more time watching videos on YouTube and TikTok, demand for video content will increase. Young people are already skipping traditional search and turning to TikTok for information presented in video form. Riparbelli argues that Synthesia's technology can help companies turn boring corporate communications, reports and training materials into content that people will actually watch and interact with.

He claims that Synthesia technology is used by 56% of Fortune 100 companies, with the vast majority of them using it for internal communications. The company lists Zoom, Xerox, Microsoft and Reuters as clients. The cost of services starts from $22 per month. The company hopes it will be a cheaper, more effective alternative to professional video—and one that may be virtually indistinguishable from it. Riparbelli says his newest avatars could easily fool a person into thinking they are real.

“I think we’re 98% there,” he says.

Synthesia is committed to creating AI avatars with people's consent, but the technology is still vulnerable to abuse. The company is introducing measures to combat misinformation and misuse of its technology.

The process for creating AI avatars in Synthesia is different from how many other avatars, deepfakes, or synthetic media, whatever you want to call them, are created.

Most deepfakes are not created in a studio. Research shows that the vast majority of deepfakes online are non-consensual sexual content, typically using images stolen from social media. Generative AI has made creating such deepfakes easy and cheap, and there have been several high-profile cases of children and women being victims of such abuse in the US and Europe. Experts also warn that the technology can be used to spread political disinformation, which is especially relevant in light of the record number of elections taking place around the world this year.

Synthesia's policy is that the company does not create avatars of people without their explicit consent. However, it is not immune to abuse. Last year, researchers discovered pro-China disinformation created using Synthesia avatars and presented as news, which the company said violated its terms of service.

Since then, the company has implemented more stringent verification and content moderation systems. She enters a watermark with information about where and how the videos with AI avatars were created.

There's a saying in AI research: “Garbage in, garbage out.” If the data used to train an AI model is garbage, this will affect the model's performance. The more data points the AI model has captured about facial movements, microexpressions, head tilts, blinks, shrugs and hand waves, the more realistic the avatar will be.

When I'm in the studio, I try my best. I stand in front of a green screen and Oshinemi guides me through the initial calibration process, where I have to move my head and then my eyes in a circular motion. Apparently this will allow the system to understand my natural colors and facial features. I am then asked to say the phrase “All the boys ate fish” which will capture all the mouth movements needed to form the vowels and consonants. We also shoot shots where I'm just silent. He then asks me to read a script for a fictional YouTube channel in different tones, guiding me through the range of emotions I should convey. First I have to read it neutrally, informatively, then encouragingly, then annoyed and complaining, and finally excitedly, convincingly. We shoot several takes with different variations of the script. Some versions allow me to move my arms. In others, Oshinemi asks me to hold a metal pin between my fingers while I do it.

Historically, getting AI avatars to look natural and match lip movements to speech has been a huge challenge, says David Barber, a professor of machine learning at University College London who is not involved in the Synthesia work. The point is that the task goes far beyond lip movements; you need to think about the eyebrows, all the facial muscles, the shrug, and the many small movements people use to express themselves.

Synthesia has been working with actors to train its models since 2020, and their doppelgangers make up 225 standard avatars that are available for clients to animate with their own scripts. But to train its latest generation of avatars, Synthesia needed more data; over the past year she has worked with approximately 1,000 professional actors in London and New York. (Synthesia says it doesn't sell the data it collects, although it does publish some of it for academic research.)

Synthesia makes efforts to prevent abuse of its technology by implementing strict rules and content moderation systems. But how effective are these measures in combating disinformation?

Synthesia, in an effort to prevent misuse of its AI avatars, has implemented strict rules and content moderation systems. Instead of having four employees dedicated to content moderation, 10% of the company's 300 employees now do the job. The company has also hired an engineer to build better AI-based content moderation systems.

These filters help Synthesia check everything its clients try to generate. Anything suspicious or controversial, such as content about cryptocurrencies or sexual health, is referred to human content moderators for review. Synthesia also keeps records of all the videos its system produces.

And while anyone can join the platform, many features are not available until people go through an extensive vetting system similar to that used in banking, which includes speaking with sales and signing legal contracts, Wojcha says. Entry-level users can only create factual content, while only enterprise customers using custom avatars can create content containing personal opinions. Additionally, only accredited news organizations can produce content on trending topics.

“We can't claim to be perfect. If people tell us something, we respond quickly [например, запрещая или ограничивая доступ] individuals or organizations,” says Wojcha. But he believes these measures act as a deterrent, which is why most attackers use open source tools.

I tested some of these limitations when I went to the Synthesia office for the next step in the process of creating my avatar. To create a video with my avatar, I have to write a script. Using Voich's account, I decided to use excerpts from Hamlet. I also tried to get my avatar to read the news about the new European Union sanctions against Iran. Voich immediately texted me: “You’re in trouble!”

The system blocked his account for attempting to create content that is prohibited. Offering services without these restrictions would be “a great growth strategy,” Riparbelli grumbles. But “at the end of the day, we have very strict rules about what you can create and what you can't create. We believe that the right approach to introducing these technologies into society is to be strict.”

However, even if these restrictions work perfectly, the Internet will still end up being a place where everything is fake. And my experiment makes me wonder how we can prepare for this. Our information landscape already seems very murky. On the one hand, there is increased public awareness that AI-generated content is thriving and can be a powerful tool for disinformation. But on the other hand, it is still unclear whether deepfakes are used for disinformation on a mass scale and whether they influence changes in people's beliefs and behavior.

If people become too skeptical of what they see, they may stop believing in anything at all, which could allow attackers to exploit this trust vacuum and lie about the authenticity of real content. Researchers call this the “cheater's benefit.” They warn that politicians, for example, could argue that truly incriminating information is fake or created by AI.

Claire Leibovitz, head of artificial intelligence and media integrity at the nonprofit Partnership on AI, says she worries that growing awareness of AI's capabilities could impact the ability to “plausibly deniable and sow doubt about real material or media as evidence in various contexts, not only in the news, [но] also in the courts, in the financial industry and in many of our institutions.” She says she is encouraged by the resources Synthesia devotes to content moderation and consent, but says the process is never flawless. Even Riparbelli admits that in the short term In the long term, the distribution of AI-generated content will likely cause problems.

Synthesia's avatar of me is a remarkably accurate, but still unnatural, copy. It makes you think about the nature of identity in the digital age and what a future where more and more content is generated by artificial intelligence will look like.

When I saw the first video with my avatar, I got a strange feeling. It was similar to how unpleasant it is to see yourself on video or hear your voice recording. At first it seemed to me that the avatar was me. But the more I watched videos of “myself,” the more I thought about it. Am I really squinting that much? Will I blink that many times? And I move my jaw this way? My God.

Avatar was good. Really good. But not ideal. “Strange but good animation,” my partner wrote to me. “But the voice sometimes sounds exactly like you, and sometimes it sounds artificial and has a strange tone,” he added.

He's right. The voice is mine sometimes, but in real life I'm more “um” and “ahh”. It’s remarkable that he caught the irregularity in my speech. My accent is a jumbled mix of trans-Atlantic, confused by years of living in the UK, watching American television and attending an international school. My avatar sometimes says the word “robot” with a British accent, and sometimes with an American accent. Probably no one would notice. But the AI noticed.

This isn't the first time I've made myself a test subject for a new AI. Not long ago I tried to generate images of an AI avatar of myself, and ended up with a bunch of nude photos. That experience was a prime example of how biased AI systems can be. But this experience—and this particular way of being memorialized—was definitely on another level. Karl Eman, an associate professor at Uppsala University who has studied digital remains and is the author of the new book The Afterlife of Data, calls avatars like the ones I created “digital corpses.”

“He looks just like you, but no one is home,” he says. “It would be like cloning you, but your clone is dead. And then you animate the corpse to move and speak using electrical impulses.”

This is exactly what it feels like. Small, nuanced moments where I don't recognize myself are enough to turn me off. On the other hand, the avatar could easily fool anyone who doesn't know me very well. And while it's no Shakespeare, it's better than many corporate presentations I've sat through. I think if I were to use it to do an annual report for my colleagues, perhaps this level of authenticity would be enough.

That's the point, says Riparbelli: “What we do is more like PowerPoint than Hollywood.” The newest generation of avatars is definitely not ready for the big screen. They're stuck in portrait mode for now, only showing the avatar from the front and from the waist up. But in the near future, Riparbelli says, the company hopes to create avatars that can communicate with their hands and conduct conversations with each other. She also plans to create full-sized avatars that can walk and move in the space that a person has generated. But is this what we really want? This looks like a dystopian future where people consume AI-generated content, presented to them by AI-generated avatars, and use AI to repackage that content into new content that will likely be used to create even more AI.

The Synthesia Avatar experience underscores the importance of strengthening content moderation efforts and ensuring trustworthy information in the digital environment. What awaits us in the future, where AI avatars become more and more realistic?

My experiment with the Synthesia avatar clearly demonstrated that the tech sector urgently needs to improve content moderation practices and ensure reliable methods of verifying the origin of content, such as watermarks.

Even if Synthesia's technology and content moderation are not yet perfect, they are vastly superior to anything I've seen in the field before, and this is only a year or two into the current boom in generative AI. AI development is moving at breakneck speed, and it's both exciting and scary to imagine what AI avatars will look like in just a few years. In the future, we may need to use code words to indicate that you are actually communicating with a real person and not an AI.

But that day has not come yet.

For the most persistent ones who read the article to the end.

In the AI-generated video, the synthetic “Melissa” performs Hamlet's famous soliloquy: watch video