Nothing spoils a Zoom meeting quite like that one team member who insists on dialing in from a noisy cafe. New AI-powered wireless headphones called ClearBuds promise a solution by blocking out background noise and ensuring their microphones pick up only the caller’s voice.
Speech-enhancement technology is already used in a variety of products including hearing aids, teleconferencing services like Zoom and Google Meet and wireless headphones like Apple’s AirPods Pro. The goal is to strip incoming audio of unwanted noise or distortions and boost the clarity of the speaker’s voice, either using traditional signal-processing algorithms or newer machine-learning approaches.
They work by exploiting spatial cues that help separate out audio sources or acoustic information that can distinguish among different kinds of noises, such as speech or traffic sounds. But doing both at the same time and with a computational budget small enough to run on consumer-grade devices is a significant challenge, and most real-world systems still leave a lot to be desired.
Using a clever combination of tailor-made in-ear wireless headphones, a bespoke Bluetooth protocol and a lightweight deep-learning model that can run on a smartphone, a team from the University of Washington has built a system dubbed ClearBuds that almost completely cuts out background noises.
“For us, ClearBuds were born out of necessity,” says Ishan Chatterjee, a doctoral student and one of the colead authors of a paper presented at the ACM International Conference on Mobile Systems, Applications, and Services describing the technology. He is not only a classmate, but also a roommate of the other two colead authors, doctoral students Maruchi Kim and Vivek Jayaram.
“When the pandemic lockdown started, like many others, we found ourselves taking many calls within our house within these close quarters, and there were many noises around the house, kitchen noises, construction noises, conversations,” says Chatterjee. So they decided to pool their experience in hardware, networking, and machine learning to solve the problem.
One of the biggest challenges for speech enhancement, says Jayaram, is separating out multiple voices. While recent machine-learning approaches have got good at distinguishing different kinds of sounds and using this to block out background noise, they still struggle when two people are talking at the same time.
The best way to solve this problem is to use multiple microphones spaced slightly apart allowing you to triangulate the source of different noises. This makes it possible to distinguish between two speakers based on where they’re located rather than what they sound like. But for this to be effective the microphones need to be at a reasonable distance.
Most commercial products have microphones in each earbud, which should be far enough apart to get decent triangulation. But streaming and syncing audio from both is beyond today’s Bluetooth standards, says Kim. That’s why Apple’s AirPods and high-end hearing aids have multiple microphones in each earbud, allowing them to do some limited triangulation before streaming from a single earbud down to the connected smartphone.
To get around this, the researchers designed a custom wireless protocol that gets one of the earbuds to transmit a time-sync beacon. The second earbud then uses this signal to match it’s internal clock to its partner’s, ensuring that the two audio streams stay in lockstep. The team implemented this protocol on custom-designed earbuds made from commodity electronic components and 3D-printed the enclosures, but syncing up the streams from each earbud solved only part of the problem.
The researchers wanted to take advantage of the latest deep-learning techniques to process the audio, but they also needed to run their speech-enhancement software on the smartphone paired with the earbuds. These models have significant computational budgets and most commercial products using AI for speech enhancement rely on transmitting the audio to powerful cloud servers. “A mobile phone, even the newer ones, is a fraction of the compute power of the GPU cards, which are typically used for running deep learning,” says Jayaram.
Their solution was to take a preexisting neural network that can learn to detect time differences in two incoming signals, therefore allowing it to triangulate the source. They then trimmed this down to its bare bones by reducing the number of parameters and layers until it could run on a smartphone. Stripping back the network like this led to a noticeable drop in audio quality, introducing crackles, static, and pops, so the researchers fed the output into another network that learns to filter these kinds of distortions out.
“The innovation was combining two different types of neural networks, each of which could be very lightweight, and in conjunction they could approach the performance of these really big neural networks that couldn’t run on an iPhone,” says Jayaram.
When tested against Apple AirPods Pro, the ClearBuds achieved a higher signal-to-distortion ratio across all tests. The team also got 37 volunteers to rate audio clips from noisy real-world environments like loud restaurants or busy traffic intersections. The ones processed through ClearBuds’ neural network were found to have the best noise suppression and overall listening experience. In real-world tests, eight volunteers significantly preferred the ClearBuds over the audio equipment they’d normally use to conduct calls.
The output does contain some distortions, says Tillman Weyde, reader in machine learning at City University of London, but they are not especially intrusive and overall the system is very effective at removing background noise and voices. “This is a great result from a student and academic team that has obviously put a tremendous amount of work into this project to make effective progress on a problem that affects hundreds of millions of people using wireless earbuds,” he adds.
Alexandre Défossez, a research scientist at Facebook AI Research Paris, says the work is very impressive but points out that one limitation is the fact that the combined time for transmitting the audio to the smartphone and then processing it is 109 milliseconds. “We always get 50 to 100 milliseconds of latency from the network,” he says. “Adding an extra 100 milliseconds is a big price to pay, and as the communication stack becomes ‘smarter’ and ‘smarter,’ we will end up with fairly noticeable and annoying delays in all our communications.”
Edd Gent is a freelance science and technology writer based in Bangalore, India. His writing focuses on emerging technologies across computing, engineering, energy and bioscience. He's on Twitter at @EddytheGent and email at edd dot gent at outlook dot com. His PGP fingerprint is ABB8 6BB3 3E69 C4A7 EC91 611B 5C12 193D 5DFC C01B. His public key is here. DM for Signal info.