Automatic Speaker Verification Systems Can Be Fooled by Disguising Your Voice

Such systems are used to build evidence in criminal cases, and grant access to personal information

3 min read
An illustration shows two mouths talking below two blank speech bubbles.
Illustration: Getty Images

Automatic speaker verification (ASV) systems are sometimes used to grant access to sensitive information and identify suspects in a court of law. Increasingly, they are being baked into consumer devices, such as Amazon’s Echo and Google’s Home, to respond to person-specific commands, such as “play my music” or “read my email.”

But such systems make mistakes when speakers disguise their voices to sound older or younger, according to a new study published in Speech Communication by researchers from the University of Eastern Finland. Earlier research by the same group has shown that some ASV systems can’t distinguish between a professional impersonator and the person they are imitating.  

It’s hard to tell how similar the systems tested for these studies are to commercial technologies, but Tomi Kinnunen, a coauthor and computer scientist at the University of Eastern Finland, says they’re probably not too far off. “There are many variants of how this is implemented in practice, but pretty much, they are still based on a lot of machine learning and signal processing,” he says.

Specifically, the researchers found that the equal error rate of an ASV system—a measure that captures times when the system mistook the same speaker for someone else, and when it tagged different speakers as the same person—increased by 11 times for male speakers and six times for female speakers who tried to sound younger than they were. When speakers tried to sound older, the system’s equal error rate increased by seven times for males and five times for females.

This means that people can fool ASV systems by changing the sound of their own voice. Speaking at a higher frequency, which most speakers did to produce a youthful voice, proved a more effective disguise than imitating an older person’s voice. For the latter, speakers also raised the frequency of their own voice, but not as high as for the younger version.  

Rosa González Hautamäki, a coauthor who defended her dissertation on this topic last Thursday, says ASV systems can’t easily detect changes to the fundamental (or lowest) frequency of a person’s voice. The fundamental frequency can rise and fall as someone speaks, but ASV systems mistake these changes for a new speaker. “Including this feature would make these systems more robust to these kinds of challenges,” she says.  

Historically, Kinnunen says, automatic speaker verification systems have shown higher error rates for female speakers (who often speak at a higher frequency) versus male speakers. In their test, the opposite was true—perhaps because both male and female speakers attempted to raise the frequencies of their voices to sound younger and older, and female speakers were already starting at a higher frequency.   

When combined with other authentication methods, a “voiceprint” may be a useful identifier. But Ben Fisher, CEO at Magic & Co., a technology consultancy in New York City, says it should generally not be trusted on its own. In addition to being susceptible to voice mimicry and disguise, voiceprints can be thwarted by simply recording someone saying a phrase (such as “open”) and replaying that recording.

Increasingly, software has made it easier to synthesize entire sentences and conversations in someone’s voice. Fisher points to Lyrebird, an AI-based service that can make a digital copy of a voice based on a 60-second clip, and use that copy to produce new sentences that the original speaker never uttered. Google is working on a similar project called WaveNet.

Fisher says fraud detection software can identify synthesized and recorded voices with relatively high accuracy—about 90 to 95 percent. But that’s still not good enough to trust automatic speaker verification systems to grant access to bank accounts and secure areas, or for other sensitive applications. And fraud detection software often requires advanced algorithms and processors not available on consumer devices, where voice is quickly becoming the primary interface.

There are also biological reasons why voice is an imperfect mode of identification—people’s voices change as they age, and even when they’re sick. These challenges have so far limited the utility of voice as a biometric. “It’s a real concern,” Fisher says. “One of the reasons you haven’t seen it everywhere is because of this problem.”

Realistically, a professional impersonator is unlikely to walk into someone’s home and imitate their voice in order to listen to their favorite Amazon Music playlist. And Kinnunen and Hautamäki say they don’t know of any case in which a criminal or impersonator has used voice disguise or mimicry to fool an ASV system for nefarious purposes.

However, Fisher believes the risk of voice-enabled hacking will worsen over time. “I think this will be a larger problem,” he says. “The ability to make a threat out of it is growing faster than the defenses.”

The Conversation (0)