Hey there, human — the robots need you! Vote for IEEE’s Robots Guide in the Webby Awards.

Close bar

Forging Voices and Faces: The Dangers of Audio and Video Fabrication

Adobe, Baidu, Google, and others have software that can fabricate convincing video or audio clips of anyone

3 min read
Photo: Jonathan Knowles/Getty Images
Photo: Jonathan Knowles/Getty Images

In 1963, before he could give the speech he’d prepared for his trip to Dallas, U.S. president John F. Kennedy was assassinated. In March 2018, a company re-created the speech that Kennedy had intended to give, synthesized from fragments of his own voice.

Technology companies including Google, Baidu, and Adobe have recently funded efforts to fabricate audio or video from samples of speech or fragments of footage. Startups including Voicery and Lyrebird have developed customizable human voices (built from audio recorded by professional voice actors) that can be programmed to say anything. These companies have also released do-it-yourself software that lets you synthesize your own voice (or someone else’s, with their permission) from a 1-minute recording. And open-source tools to build such programs are available on Github.

The results are now convincing enough to raise concerns that these tools could fall into the wrong hands. “It’s not unreasonable to think that you could fool a large group of people with the technology in the state that it’s in today,” says Michael Fauscette, chief research officer at the software review site G2 Crowd.

Someone could use a synthesized voice to deceive devices trained to recognize an individual’s “voiceprint,” or generate a fake video to use as blackmail. Fabricating statements by world leaders, or publishing fake videos of CEOs, could create problems much faster than those clips could be debunked.

To synthesize audio or video, experts have primarily focused on two techniques, both of which rely on machine learning: text-to-speech (generating humanlike speech from annotated voice recordings), and style transfer (in which the style of a piece of content, such as the 1889 painting The Starry Night by Vincent van Gogh, is imposed onto photos or videos).

Mikel Rodriguez, a machine-learning researcher at the Mitre Corp., says the algorithms used to fabricate videos are a twist on techniques that have long been used to classify images. Such programs rely on convolutional neural networks, in which artificial neurons learn to use a numerical matrix, called a filter, to assign values to pixels in an image.

These programs have traditionally used those values to draw a conclusion about an image’s content—to say, for example, whether a photo shows a dog or doesn’t show a dog. In the new versions, ­Rodriguez explains, “Essentially, instead of saying, ‘Give me the answer,’ you’re saying, ‘Give me the pixel.’ ”

Such systems are rapidly improving. In December 2017, Google researchers published a paper describing ­Tacotron 2, a text-to-speech (TTS) system based on neural networks that could generate speech that sounds as natural to listeners as recordings of people would. In February, Baidu described Deep Voice 3, a TTS system that can be trained much faster than the original version of ­Tacotron. A month later, Google published two more papers devoted to improving Tacotron’s ability to convey humanlike expressiveness—such as intonation, stress, and rhythm—to match the content of its synthesized speech.

Companies believe the tools needed to synthesize voices or videos could become valuable products. CereProc, the company that synthesized ­Kennedy’s speech, has created more than 100 custom voices for people who have lost their own due to illness. In a statement about its beta TTS project, called VoCo, Adobe said podcast producers or advertisers could use it to make ­last-minute edits to a show or voice-over.

To avoid misunderstandings, creators could embed a digital watermark into any synthesized media they produce. But there’s no guarantee that everyone will follow the same rules. And there’s no good way to independently tell whether a video or audio recording has been falsified. “Right now, there’s no tool that works all the time,” Rodriguez says.

Bryce Goodman, who co-led a workshop on machine deception at the Neural Information Processing Systems conference last December, is concerned about a wider loss of trust that such programs could engender: “I think we’re still at the point of people not necessarily thinking through the implications of what their research or hobby projects have in the long run.”

This article appears in the May 2018 print issue as “Forging Voices and Faces.”

This article is for IEEE members only. Join IEEE to access our full archive.

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, podcasts, and special reports. Learn more →

If you're already an IEEE member, please sign in to continue reading.

Membership includes:

  • Get unlimited access to IEEE Spectrum content
  • Follow your favorite topics to create a personalized feed of IEEE Spectrum content
  • Save Spectrum articles to read later
  • Network with other technology professionals
  • Establish a professional profile
  • Create a group to share and collaborate on projects
  • Discover IEEE events and activities
  • Join and participate in discussions