“Hey, Siri, add proctologist appointment to my calendar for 12:50pm tomorrow.”
“Hey, Siri, play Wind Beneath My Wings by Bette Midler on repeat.”
“Hey, Siri, what’s 52 + 25?”
“Hey, Siri, Google me.”
There are lots of digital assistant commands one should never say out loud in front of other people. But in a few years, proximity to eavesdroppers might not be a problem anymore, because we’ll be able to silently mouth all that embarrassing dictation.
Called “silent speech recognition,” the technology essentially reads lips. Users mouth the words of a command or message, and the device deciphers the words, based on the movements of the face and neck.
The feat can be accomplished using surface electromyography, or sEMG. Electrodes are placed on the skin at key places around the mouth, along the jaw, under the chin or on the neck. Muscle movements in those areas generate neuromuscular signals—essentially an electrical code. Algorithms trained on silent speech then translate the electrical signals, decoding what the user is saying.
Last week, among the sea of abstracts at the IEEE International Conference on Micro Electro Mechanical Systems (MEMS) in Seoul, researchers from Soochow University presented an advance in this quiet corner of bioengineering: a super-flexible electrode patch that adheres to the sharp curvature of the jaw and is capable of decoding a few silent words with about 72-percent accuracy.
A flexible patch offers a simple, ergonomic way to attach the device to the face quickly—something many previous designs lack. “I’ll be curious when that comes out to see what they’ve done,” says Geoff Meltzner, vice president of research and technology at VocalID, a synthetic voice company, who was not involved in the Soochow project. “I think this could solve an ergonomics or ease-of-use problem,” he says.
But a flexible material addresses just one piece of the challenge. A handful of other groups have, since NASA’s pioneering work 15 years ago, been quietly building the nuts and bolts of a silent speech recognition device. One of them is Delsys, a wearable sensor maker, which has been working on the technology since about 2006, says Meltzner, who was the primary investigator in the early years of the project.
Delsys’ current prototype has eight custom, rigid electrodes that are placed on the face and neck. They are connected to a computer where algorithms perform the decoding. The system can recognize about 2200 mouthed words—about 5 percent of the 42,000 words in the average English speaker’s vocabulary. (And yes, this writer told Siri to do that math while no one was within earshot.)
In a study published in June 2018, Delsys’ device proved to be about 91 percent accurate in translating silent speech. One can imagine the potential applications: any phone user who doesn’t want to be heard, people working in loud environments, and military personnel who need hands-free covert communication.
The tool not only saves us from divulging private information, but could also make us sound smart. Imagine chatting with a group of people, and being able to secretly Google something, or do complicated math, or search for the name of an acquaintance you should totally already know, just in time to introduce her. Smooth.
But no one is going to be fooled by someone mouthing the words “Hey Siri, what is Sylvia’s last name?” while wearing a headset that wraps from his ear to his chin.
To address that problem, researchers at the MIT Media Lab are developing a covert communication device they call AlterEgo. The current prototype is bulky, but it can decipher words iterated silently with the mouth closed. “We call it internal articulation,” says Arnav Kapur, a graduate student at MIT who is heading up the project.
To communicate with the device, the user makes slight internal movements with the mouth closed, kind of like the way one might form words while silently reading, says Kapur.
The electrical signals generated by such movements are much weaker than those generated by mouthing the words, because they come from deeper within the body. That makes the signals harder for surface electrodes to pick up and decode, compared with Delsys’ system.
AlterEgo can recognize about 100 internally articulated words with about 92 percent accuracy, Kapur says. He and his team have also linked up the device with bone conduction headphones so that the computer’s responses get piped into the user’s ear.
This creates an insulated, closed loop system between the user and his computer. Eavesdroppers can’t hear the dictation from the user, nor the computer’s response. “You feel like you have an AI agent in your head that you can interact with,” says Kapur.
Both MIT’s and Delsys’ prototypes have a ways to go before they’re ready for commercialization. Both must increase the number of words that their algorithms can recognize, and that has to be done from scratch. “Because ours is a new modality, we don’t have a publicly available dataset [of internally articulated speech], so we’re conducting our own experiments and recording our own data,” says Kapur.
The devices must be calibrated for the specific person who will be using it, to accommodate for different dialects and electrophysiology. For Delsys’ device, that takes a few hours per person, says Meltzner.
But perhaps clunky hardware and hours of calibration is a small price to pay for those with a medical need. Indeed, one of Delsys’ first applications may be giving voice to people who don’t have one.
In a study of eight people whose voice boxes had been surgically removed due to trauma or disease, the Delsys device accurately translated their silently articulated words nearly 90 percent of the time. Delsys scientists published their results in December in IEEE/ACM Transactions on Audio, Speech and Language Processing.
Delsys is partnering with Meltzner’s company, VocalID, to turn silent speech-to-text back to speech for those patients. A person who knows he will be having a laryngectomy could go to VocalID to “bank” his voice—a way to synthetically save it. Later, after his surgery, the patient could not only have his silent speech translated to text, but also re-vocalized, using the recordings of his voice.