Mitsubishi Electric’s AI Can Follow and Separate Simultaneous Speech

The cocktail party problem refers to the challenge of following a single person’s speech in a room full of surrounding chatter and noise. With a little concentration, humans can focus in on what a particular person is saying. But when we want technology to separate the speech of a targeted person from the simultaneous conversations of others—as we do with hands-free telephony when a caller is in a car with kids in the back seat—the results leave much to be desired.

Until now, that is, says Mitsubishi Electric. The company demonstrated its speech separation technology at its annual R&D Open House in Tokyo on 24 May. In one type of demonstration, two people spoke a sentence in different languages simultaneously into a single microphone. The speech separation technology separated the two sentences in real time (about 3 seconds), and then reconstructed and played them back consecutively with impressive accuracy. However, the demonstration took place in a closed room and required silence from all those watching.

A second demonstration used a simulated mix of three speakers. Unsurprisingly, the result was noticeably less accurate.

Mitsubishi claims up to 90-percent and 80-percent accuracy levels respectively for the two scenarios under ideal conditions of low ambient noise and speakers talking at about the same volume—the best ever, the company believes. This compares well to conventional technology, which has an accuracy of only around 50 percent for two speakers using a single microphone, says the company.

The technology uses Mitsubishi’s Deep Clustering, a proprietary deep-learning method based on artificial intelligence.

The system has learned how to examine and separate mixed speech data. A deep network encodes the speech signals or elements based on each speaker’s tone, pitch, intonation, etc. The encoded signals are optimized so that different components belonging to the same speaker have similar encodings, while those belonging to another speaker have dissimilar encodings. A clustering algorithm processes the encodings into groups depending on their similarities. Each person’s speech is then reconstructed by synthesizing the separated speech components.

“Unlike separating a speaker from background noise, separating a speaker’s speech from another speaker talking is most difficult, because they have similar characteristics,” says Anthony Vetro, deputy director at Mitsubishi Electric Research Laboratories in Cambridge, Mass. “You can do it to some degree by using more elaborate set-ups of two or more mics to localize the speakers, but it is very difficult with just one mic.”

The beauty of this system, he adds, is that it is not speaker dependent, so no speaker-specific training is involved. Similarly, it is not language dependent.

Yohei Okato, senior manager of Mitsubishi Electric’s Natural Language Processing Technology Group in Kamakura, near Tokyo, says the company will use the technology to improve the quality of voice communications and the accuracy of automatic speech recognition in applications such as controlling automobiles and elevators, as will as in the home to operate various appliances and gadgets. “We will be introducing it in the near future,” he adds.

mitsubishi electric software ai machine learning

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Mitsubishi Electric’s AI Can Follow and Separate Simultaneous Speech

The Japanese company believes it has created speech separation technology good enough to solve the cocktail party problem

Are We Testing AI Intelligence the Wrong Way?

Why BYD's Hybrid Is Perfect for Brazil

Room-Size Particle Accelerators Go Commercial

Related Stories

Why IT Projects Repeat Costly Mistakes

Trillions Spent and Big Software Projects Are Still Failing

Airflow: From Stagnation to Millions of Downloads

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and post comments — all free! For full access and benefits, subscribe to Spectrum.

Mitsubishi Electric’s AI Can Follow and Separate Simultaneous Speech

The Japanese company believes it has created speech separation technology good enough to solve the cocktail party problem

Are We Testing AI Intelligence the Wrong Way?

Why BYD's Hybrid Is Perfect for Brazil

Room-Size Particle Accelerators Go Commercial

Related Stories

Why IT Projects Repeat Costly Mistakes

Trillions Spent and Big Software Projects Are Still Failing

Airflow: From Stagnation to Millions of Downloads