Forging Voices and Faces: The Dangers of Audio and Video Fabrication

In 1963, before he could give the speech he’d prepared for his trip to Dallas, U.S. president John F. Kennedy was assassinated. In March 2018, a company re-created the speech that Kennedy had intended to give, synthesized from fragments of his own voice.

Technology companies including Google, Baidu, and Adobe have recently funded efforts to fabricate audio or video from samples of speech or fragments of footage. Startups including Voicery and Lyrebird have developed customizable human voices (built from audio recorded by professional voice actors) that can be programmed to say anything. These companies have also released do-it-yourself software that lets you synthesize your own voice (or someone else’s, with their permission) from a 1-minute recording. And open-source tools to build such programs are available on Github.

The results are now convincing enough to raise concerns that these tools could fall into the wrong hands. “It’s not unreasonable to think that you could fool a large group of people with the technology in the state that it’s in today,” says Michael Fauscette, chief research officer at the software review site G2 Crowd.

Someone could use a synthesized voice to deceive devices trained to recognize an individual’s “voiceprint,” or generate a fake video to use as blackmail. Fabricating statements by world leaders, or publishing fake videos of CEOs, could create problems much faster than those clips could be debunked.

To synthesize audio or video, experts have primarily focused on two techniques, both of which rely on machine learning: text-to-speech (generating humanlike speech from annotated voice recordings), and style transfer (in which the style of a piece of content, such as the 1889 painting The Starry Night by Vincent van Gogh, is imposed onto photos or videos).

Mikel Rodriguez, a machine-learning researcher at the Mitre Corp., says the algorithms used to fabricate videos are a twist on techniques that have long been used to classify images. Such programs rely on convolutional neural networks, in which artificial neurons learn to use a numerical matrix, called a filter, to assign values to pixels in an image.

These programs have traditionally used those values to draw a conclusion about an image’s content—to say, for example, whether a photo shows a dog or doesn’t show a dog. In the new versions, Rodriguez explains, “Essentially, instead of saying, ‘Give me the answer,’ you’re saying, ‘Give me the pixel.’ ”

Such systems are rapidly improving. In December 2017, Google researchers published a paper describing Tacotron 2, a text-to-speech (TTS) system based on neural networks that could generate speech that sounds as natural to listeners as recordings of people would. In February, Baidu described Deep Voice 3, a TTS system that can be trained much faster than the original version of Tacotron. A month later, Google published two more papers devoted to improving Tacotron’s ability to convey humanlike expressiveness—such as intonation, stress, and rhythm—to match the content of its synthesized speech.

Companies believe the tools needed to synthesize voices or videos could become valuable products. CereProc, the company that synthesized Kennedy’s speech, has created more than 100 custom voices for people who have lost their own due to illness. In a statement about its beta TTS project, called VoCo, Adobe said podcast producers or advertisers could use it to make last-minute edits to a show or voice-over.

To avoid misunderstandings, creators could embed a digital watermark into any synthesized media they produce. But there’s no guarantee that everyone will follow the same rules. And there’s no good way to independently tell whether a video or audio recording has been falsified. “Right now, there’s no tool that works all the time,” Rodriguez says.

Bryce Goodman, who co-led a workshop on machine deception at the Neural Information Processing Systems conference last December, is concerned about a wider loss of trust that such programs could engender: “I think we’re still at the point of people not necessarily thinking through the implications of what their research or hobby projects have in the long run.”

This article appears in the May 2018 print issue as “Forging Voices and Faces.”

software Google audio Baidu ethics Adobe voice synthesis

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Forging Voices and Faces: The Dangers of Audio and Video Fabrication

Adobe, Baidu, Google, and others have software that can fabricate convincing video or audio clips of anyone

Related Stories

Software Sucks, but It Doesn’t Have To

AI-Powered Proof Generator Helps Debug Software

GPUs Can Now Analyze a Billion Complex Vectors in Record Time

This article is for IEEE members only. Join IEEE to access our full archive.

Membership includes:

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Forging Voices and Faces: The Dangers of Audio and Video Fabrication

Adobe, Baidu, Google, and others have software that can fabricate convincing video or audio clips of anyone

Related Stories

Software Sucks, but It Doesn’t Have To

AI-Powered Proof Generator Helps Debug Software

GPUs Can Now Analyze a Billion Complex Vectors in Record Time

This article is for IEEE members only. Join IEEE to access our full archive.

Membership includes: