Think you have trouble deciphering social media slang? Try translating it. Microsoft researchers have been studying how to translate social media, and in their efforts they came across a way to teach the company’s upcoming Skype Translator how to speak more like us.
Some researchers think social media could be key to getting computers to better understand humans. Social media experiments are “important examples of a new line of research in computational social science, showing that subtle social meaning can be automatically extracted from speech and text in a complex natural task,” says Dan Jurafsky, an expert in computational linguistics at Stanford, who recently led work on teaching computers about human interactions by listening to speed dating.
The Skype Translator app, set for beta release later this year, translates multilingual conversations over the service as they’re happening. In May, Gurdeep Singh Pall, corporate vice president of Skype and Lync at Microsoft, and a German-speaking colleague demoed the app at the Code Conference, in Rancho Palos Verdes, Calif. As Pall spoke in English, both German and English subtitles scrolled along the bottom of the screen while real-time audio translation accompanied the subtitles.
The software system is a synthesis of several technologies, including speech recognition, machine translation, and speech synthesis. But Vikram Dendi, technical and strategy advisor at Microsoft Research, in Redmond, Wash., says past attempts to simply daisy-chain the technologies were unsuccessful because developers had failed to consider the drastic difference between the way we speak and the way we write.
For starters, real speech is peppered with vocalized “ums” and “ahs,” awkward pauses, varying intonations, and vocal stresses, which are all absent in text. Consider what would happen if a speech translation system misinterpreted the subtle difference between these two statements:
“You’re picking up the kids?”
And “You’re picking up the kids!”
Suffice it to say, grumpy offspring would be the end product.
The gap exists between translating text and translating speech because some of the best machine translation systems today are taught using large volumes of high-quality text, which does not include the awkwardness that speech recognition systems deal with. So Microsoft Research set about searching for techniques to help close that gap. Among them was a software system the company developed to translate social media musings.
Before turning to social media, Microsoft’s translation system extracted text from published books and Web sources that had been translated from one language to another. The data was then fed into a machine-learning pipeline that Microsoft calls phrasal statistical machine translation (phrasal SMT). The system chops up the text into a collection of small phrases called an n-gram, where n denotes the number of phrases. If the system is trying to translate, say, English to German, then the n-gram from a text in English is mapped to the n-gram of the equivalent text in German. This process teaches the computer what each phrase translates to.
Once it has learned its fill from the n-gram alignment, the software is ready to encounter new, untranslated text. When the machine is asked to translate a new phrase in English, the algorithm calculates the probability that the new English segment of text maps to one of the phrases it knows in German. The system then spits out the most probable translation.
Phrasal SMT excels at memorizing and matching data. For common phrases it can translate that exact phrase across several languages, and even if the words in the phrase are slightly reordered, it still works. But if the words in an uncommon phrase are reordered, the system gets confused. Some of the confusion arises because SMT doesn’t really understand grammar and so can’t shift from the rules of one language to those of another. For example, an English sentence usually runs subject, verb, object. But the same sentence in Japanese would be subject, object, verb.
This is why the Microsoft Research team pioneered a system known as syntactically
informed phrasal statistical machine translation (syntactic SMT). It builds on the phrasal SMT foundation but also understands syntax. Instead of just matching common phrases, syntactic SMT breaks up a phrase into individual words and then maps each word over to the other language.
Cutting up phrases and connecting individual words may sound like a primitive approach, but it’s not. “That’s pretty much the best method,” says Chris Manning, professor of linguistics and computer science at Stanford. “Microsoft’s machine translation team has been one of the prominent developers in this area, and basically, that is the state of the art in machine translation at the moment.”
Syntactic SMT was a big step, but there was room for improvement, particularly in the fast-growing universe of social media. The Microsoft Research team began studying communications on Facebook, by Short Message Service (SMS), and on Twitter to figure out the best way to manage conversational text.
But that came with a new set of problems. Each social media platform has its own distinct characteristics—Facebook posts incorporate more emotional expressions, SMS users type shorter messages, and tweets are something in between. So researchers had to first develop a social media text normalization system, software that could automatically adapt to these variations in style to produce something that syntactic SMT can process. Adding the normalizer system to the translator’s training protocol helped increase the accuracy of social-media text translation by 6 percent, according to Microsoft’s Dendi. “That significantly improved the quality,” he says. “Of course, there’s still a lot of work to do, but when we did this, it really did move the needle on understanding and translating that type of data better.” What’s more, the techniques developed to improve social media translation are very similar to what was needed to bridge the gap between speech recognition and translation.
Skype Translator isn’t the only speech translation system on the scene, though. According to Macduff Hughes, engineering director of Google Translate, many people use his company’s software to test their own ability to speak a foreign language. He also says that in the past year, Google has added new features on its mobile apps that allow people to use Translate in more scenarios. But the system doesn’t yet translate in real time and is not integrated into a video telephony application, which means multilingual speakers need to be in the same location and speak into the same app.
Google might be one of the only other companies with a shot at making a comparable system. Dendi says Microsoft’s Skype work required deep knowledge of the company’s Bing Web index to build the translation system, and a company would need similar assets to build another. “That’s why there are only a few places in the world that can build a system of this kind and scale that can serve millions and millions of customers in this fashion across a range of scenarios,” Dendi says.
This article originally appeared in print as “How Social Media Teaches Skype to Speak.”
Theresa Chong is a video host and multimedia technology journalist based in Palo Alto, Calif. As on-camera talent, she has performed science experiments for “Discovery News,” explained how virtual reality works for USA Today, and interviewed Adam Savage for IEEE Spectrum. She has written about wearables for Scientific American and travel tech for Architectural Digest. With a DSLR, GoPro, and green screen by her side, she has produced digital videos of robots, driverless cars, and 3D printing. She earned a master’s degree from Northwestern University’s Medill School of Journalism, and in a prior life she worked as a civil engineer.