Forging Voices and Faces: The Dangers of Audio and Video Fabrication

Adobe, Baidu, Google, and others have software that can fabricate convincing video or audio clips of anyone

3 min read
Photo: Jonathan Knowles/Getty Images
Photo: Jonathan Knowles/Getty Images

In 1963, before he could give the speech he’d prepared for his trip to Dallas, U.S. president John F. Kennedy was assassinated. In March 2018, a company re-created the speech that Kennedy had intended to give, synthesized from fragments of his own voice.

Technology companies including Google, Baidu, and Adobe have recently funded efforts to fabricate audio or video from samples of speech or fragments of footage. Startups including Voicery and Lyrebird have developed customizable human voices (built from audio recorded by professional voice actors) that can be programmed to say anything. These companies have also released do-it-yourself software that lets you synthesize your own voice (or someone else’s, with their permission) from a 1-minute recording. And open-source tools to build such programs are available on Github.

The results are now convincing enough to raise concerns that these tools could fall into the wrong hands. “It’s not unreasonable to think that you could fool a large group of people with the technology in the state that it’s in today,” says Michael Fauscette, chief research officer at the software review site G2 Crowd.

Someone could use a synthesized voice to deceive devices trained to recognize an individual’s “voiceprint,” or generate a fake video to use as blackmail. Fabricating statements by world leaders, or publishing fake videos of CEOs, could create problems much faster than those clips could be debunked.

To synthesize audio or video, experts have primarily focused on two techniques, both of which rely on machine learning: text-to-speech (generating humanlike speech from annotated voice recordings), and style transfer (in which the style of a piece of content, such as the 1889 painting The Starry Night by Vincent van Gogh, is imposed onto photos or videos).

Mikel Rodriguez, a machine-learning researcher at the Mitre Corp., says the algorithms used to fabricate videos are a twist on techniques that have long been used to classify images. Such programs rely on convolutional neural networks, in which artificial neurons learn to use a numerical matrix, called a filter, to assign values to pixels in an image.

These programs have traditionally used those values to draw a conclusion about an image’s content—to say, for example, whether a photo shows a dog or doesn’t show a dog. In the new versions, ­Rodriguez explains, “Essentially, instead of saying, ‘Give me the answer,’ you’re saying, ‘Give me the pixel.’ ”

Such systems are rapidly improving. In December 2017, Google researchers published a paper describing ­Tacotron 2, a text-to-speech (TTS) system based on neural networks that could generate speech that sounds as natural to listeners as recordings of people would. In February, Baidu described Deep Voice 3, a TTS system that can be trained much faster than the original version of ­Tacotron. A month later, Google published two more papers devoted to improving Tacotron’s ability to convey humanlike expressiveness—such as intonation, stress, and rhythm—to match the content of its synthesized speech.

Companies believe the tools needed to synthesize voices or videos could become valuable products. CereProc, the company that synthesized ­Kennedy’s speech, has created more than 100 custom voices for people who have lost their own due to illness. In a statement about its beta TTS project, called VoCo, Adobe said podcast producers or advertisers could use it to make ­last-minute edits to a show or voice-over.

To avoid misunderstandings, creators could embed a digital watermark into any synthesized media they produce. But there’s no guarantee that everyone will follow the same rules. And there’s no good way to independently tell whether a video or audio recording has been falsified. “Right now, there’s no tool that works all the time,” Rodriguez says.

Bryce Goodman, who co-led a workshop on machine deception at the Neural Information Processing Systems conference last December, is concerned about a wider loss of trust that such programs could engender: “I think we’re still at the point of people not necessarily thinking through the implications of what their research or hobby projects have in the long run.”

This article appears in the May 2018 print issue as “Forging Voices and Faces.”

The Conversation (0)

The Future of Deep Learning Is Photonic

Computing with light could slash the energy needs of neural networks

10 min read

This computer rendering depicts the pattern on a photonic chip that the author and his colleagues have devised for performing neural-network calculations using light.

Alexander Sludds

Think of the many tasks to which computers are being applied that in the not-so-distant past required human intuition. Computers routinely identify objects in images, transcribe speech, translate between languages, diagnose medical conditions, play complex games, and drive cars.

The technique that has empowered these stunning developments is called deep learning, a term that refers to mathematical models known as artificial neural networks. Deep learning is a subfield of machine learning, a branch of computer science based on fitting complex models to data.

Keep Reading ↓ Show less