New Techniques Emerge to Stop Audio Deepfakes

A recent FTC challenge crowned three ways to thwart nefarious voice clones

4 min read
illustration of a circle with colorful waves inside with a robot hand touching the end, a tuning fork on the left side and tweezers above
Moor Studios/Getty Images

Voice cloning—in which AI is used to create fake yet realistic-sounding speech—has its benefits, such as generating synthetic voices for people with speech impairments. But the technology also has plenty of malicious uses: Scammers can use AI to clone voices to impersonate someone and swindle individuals or companies out of millions of dollars. Voice cloning can also be used to generate audio deepfakes that spread election disinformation.

To combat the increasing dangers posed by audio deepfakes, the U.S. Federal Trade Commission (FTC) launched its Voice Cloning Challenge. Contestants from both academia and industry were tasked with developing ideas to prevent, monitor, and evaluate voice cloning used for nefarious purposes. The agency announced the contest’s three winners in April. These three teams all approached the problem differently, demonstrating that a multipronged, multidisciplinary approach is required to address the challenging and evolving harms posed by audio deepfakes.

3 Ways to Tackle Audio Deepfakes

One of the winning entries, OriginStory, aims to validate a voice at the source. “We’ve developed a new kind of microphone that verifies the humanness of recorded speech the moment it’s created,” says Visar Berisha, a professor of electrical engineering at Arizona State University who leads the development team along with fellow ASU faculty members Daniel Bliss and Julie Liss.

a man in a blue shirt sitting in front of a microphone with a chip and an orange cord coming off of itVisar Berisha records his voice using an OriginStory microphone.Visar Berisha

OriginStory’s custom microphone records acoustic signals just as a conventional microphone does, but it also has built-in sensors to detect and measure biosignals that the body emits as a person speaks, such as heartbeats, lung movements, vocal-cord vibrations, and the movement of the lips, jaw, and tongue. “This verification is attached to the audio as a watermark during the recording process and provides listeners with verifiable information that the speech was human-generated,” Berisha says.

Another winning entry, the aptly named AI Detect, intends to use AI to catch AI. Proposed by OmniSpeech, a company that makes AI-powered speech-processing software, AI Detect would embed machine learning algorithms into devices like phones and earbuds that have limited compute power to distinguish AI-generated voices in real time. “Our goal is to have some sort of identifier when you’re talking on your phone or using a headset, for example, that the entity on the other end may not be a real voice,” says OmniSpeech CEO David Przygoda.

The final winning entry, DeFake, is another AI tool. DeFake adds tiny perturbations to a human voice recording, making precise cloning more difficult. “You can think about the perturbations as small scrambling noises added to a human-voice recording, which AI uses to learn about the signature of a human voice,” says Ning Zhang, an assistant professor of computer science and engineering at Washington University in St. Louis. “Therefore, when AI tries to learn from the recorded sample of that speech, it would make a mistake and learn something else.”

Zhang says DeFake is an example of what’s called adversarial AI, a defensive technique that attacks the ability of an AI model to work properly. “We are embedding small snippets of attacks to attack the AI of the attackers—the people trying to steal our voices,” he adds.

Implementing Audio Deepfake Defenses

Both AI Detect and DeFake are in their early R&D stages. AI Detect is still conceptual, while DeFake needs more efficiency improvements. And Przygoda and Zhang are aware of the drawbacks of using artificial intelligence.

“This is going to take an ongoing effort where we’re updating our datasets and our technology to keep up with the developments in the models and the hardware being used to create deepfakes. This is something that’s going to take active monitoring,” Przygoda says.

Zhang echoes the sentiment: “AI is moving really fast, so we need to constantly make sure to adjust our technique as new capabilities come up. And as the defenders, we don’t know what attackers are using as an AI model, so we need to be able to generically defend against all of the attacks while maintaining the quality of the voice, which makes things a lot harder.”

Meanwhile, OriginStory is in the testing stage and is working to spoof-proof the technology. “We’re running a validation study with lots of different users trying to trick it into thinking that there’s a human behind the microphone when there’s not. At the end of that we’ll have a sense of how robust it is. You need to know with really high certainty that the person on the other end is a human,” Berisha says.

“We are embedding small snippets of attacks to attack the AI of the attackers—the people trying to steal our voices.” —Ning Zhang, Washington University in St. Louis

According to Nauman Dawalatabad, a postdoctoral associate in the Spoken Language Systems group at MIT’s Computer Science and Artificial Intelligence Laboratory, AI Detect’s approach is promising. “It’s crucial for a fake/real audio detection model to operate on-device to preserve privacy, rather than sending personal data to a company’s server.”

Meanwhile, Dawalatabad views DeFake’s preventive strategy, which he likens to watermarking, as a good solution to protect consumers from fraud when their speech data is compromised or intercepted. “However, this approach depends on knowing all the source speakers and requires careful implementation. For instance, simply rerecording a watermarked speech with another microphone device can fully or partially remove the effects of a watermark,” he adds.

As for OriginStory, Dawalatabad says that the technology’s similar preventive method of stamping at the source “seems more robust than software-based watermarking alone, as it relies on biosignals that are difficult to replicate.”

But Dawalatabad notes that an even more effective tactic to tackle the problem of audio deepfakes is a four-pronged approach that combines multiple strategies. The first step, he says, is to watermark new audio recordings now to make them traceable. The second step is what the winning entries are embarking on—developing better detection models, which are “crucial for securing current data, much of which is not watermarked,” he says.

The third step involves deploying detection models directly on devices to enhance security and preserve privacy. “This includes coming up with better model compression algorithms to deploy on resource-constrained devices,” says Dawalatabad. “Also, I suggest adding these detection models at the system level by the manufacturers themselves.”

Finally, Dawalatabad emphasizes the need to “engage policymakers to ensure consumer protection while promoting solutions wherever possible.”

The three winners of the FTC’s Voice Cloning Challenge will share a total cash prize of US $35,000.A fourth solution, from information security company Pindrop, received a recognition award. The solution detects audio deepfakes in real time by analyzing speech in 2-second intervals and flagging those identified as potentially suspicious.
The Conversation (0)