Baidu’s AI Can Do Simultaneous Translation Between Any Two Languages

Baidu Research reveals a translation tool that keeps up by predicting the future

Photo-illustration of on the fly translation.
Photo-illustration: Shutterstock

Would-be travelers of the galaxy, rejoice: The Chinese tech giant Baidu has invented a translation system that brings us one step closer to a software Babel fish.

For those unfamiliar with the Douglas Adams masterworks of science fiction, let me explain. The Babel fish is a slithery fictional creature that takes up residence in the ear canal of humans, tapping into their neural systems to provide instant translation of any language they hear.

In the real world, until now, we’ve had to make do with human and software interpreters that do their best to keep up. But the new AI-powered tool from Baidu Research, called STACL, could speed things up considerably. It uses a sophisticated type of natural language processing that lags only a few words behind, and keeps up by predicting the future.

“What’s remarkable is that it predicts and anticipates the words a speaker is about to say a few seconds in the future,” says Liang Huang, principal scientist of Baidu’s Silicon Valley AI Lab. “That’s a technique that human interpreters use all the time—and it’s critical for real-world applications of interpretation technology.” 

The STACL (Simultaneous Translation with Anticipation and Controllable Latency) tool is comparable to the human interpreters who sit in booths during UN meetings. These humans have a tough job. As a dignitary speaks, the interpreters must simultaneously listen, mentally translate, and speak in another language, usually lagging only a few words behind. It’s such a difficult task that UN interpreters usually work in teams and take shifts of only 10 to 30 minutes.

A task requiring that kind of parallel processing—listening, translating, speaking—seems well suited for computers. But until now, it was too hard for them, too. The best “real-time” translating systems still do what’s called consecutive translation, in which they wait for each sentence to conclude before rendering its equivalent in another language. These systems provide quite accurate translations, but they’re slow. 

Huang tells IEEE Spectrum that the big challenge in simultaneous interpretation comes from word order differences in various languages. “In the UN, there’s a famous joke that an interpreter who’s translating from German to English will pause, and seem to get stuck,” he says. “If you ask why, they say, ‘I’m waiting for the German verb.’” In English, the verb comes early in the sentence, he explains, while in German it comes at the very end of the sentence.

STACL gets around that problem by predicting the verb to come, based on all the sentences it has seen in the past. For their current paper, the Baidu researchers trained STACL on newswire articles, where the same story appeared in multiple languages. As a result, it’s good at making predictions about sentences dealing with international politics.

Huang gives an example of a Chinese sentence, which would be most directly translated as “Xi Jinping French president visit expresses appreciation.” STACL, however, would guess from the beginning of the sentence that the visit would go well, and translates it into English as “Xi Jinping expresses appreciation for the French president’s visit.”

For their current paper, the researchers demonstrated its capabilities in translating from Chinese to English (two languages with big differences in word order). “In principle, it can work on any language pair,” Huang says. “There’s data on all those other languages. We just haven’t run those experiments yet.” 

Clearly, STACL can make mistakes. If the French president’s visit hadn’t gone well, and Xi Jinping instead expressed regret and dismay, the translation would have a glaring error. At the moment, it can’t correct its mistakes. “A human interpreter would apologize, but our current system doesn’t have the capability to revise an error,” Huang says.

However, the system is adjustable, and users will be able to make trade-offs between speed and accuracy. If STACL is programmed to have longer latency—to lag five words behind the original text instead of three words behind—it’s more likely to get things right.

It can also be made more accurate by training it in a particular subject, so that it understands the likely sentences that will appear in presentations at, say, a medical conference. “Just like a human simultaneous interpreter, it would need to do some homework before the event,” Huang says.

Huang says STACL will be demoed at a Baidu World conference on 1 November, where it will provide live simultaneous translation of the speeches. The aim is to eventually put this capability into consumers’ pockets. Baidu has previously shown off a prototype consumer device that does sentence-by-sentence translation, and Huang says his team plans to integrate STACL into that gadget. 

Right now, STACL works on text-to-text translation and speech-to-text translation. To make it useful for a consumer device, the researchers want to master speech-to-speech translation. That will require integrating speech synthesis into the system. And when the speech is being synthesized only a few words at a time, without any knowledge of the whole sentence’s structure, it will be a challenge to make it sound natural. 

Huang says his goal is to make instant translation services more readily accessible and affordable to the general public. But he notes that STACL is “not intended to replace human interpreters—especially for high-stakes situations that require precise and consistent translations.” After all, nobody wants an AI to be at the center of an international incident because it erroneously predicts Xi Jinping’s expressions of appreciation or regret.  

The Tech Alert Newsletter

Receive latest technology science and technology news & analysis from IEEE Spectrum every Thursday.

About the Tech Talk blog

IEEE Spectrum’s general technology blog, featuring news, analysis, and opinions about engineering, consumer electronics, and technology and society, from the editorial staff and freelance contributors.