Talk to the Machine
With better chips and faster algorithms, device makers are putting voice interfaces in PDAs, cellphones, and cars
This is part of IEEE Spectrum's special R&D report: They Might Be Giants: Seeds of a Tech Turnaround.
Max Huang says he has something cool to show me. I'm skeptical: he's holding in his hand what looks like a PDA. It is a PDA, a Compaq 3600, to be exact, unadorned and, to my eyes, unremarkable. What's special is what's inside: this PDA understands what you say.
Huang and his colleagues at the Philips Speech Processing office in Taipei, Taiwan, have streamlined the company's standard speech recognition engine, meant for servers and PCs, to run instead on a PDA. It's just a prototype, Huang says, but the Mandarin-language recognizer can distinguish about 40 000 words and still not tax the Compaq's memory, power, or processing. With it, Huang can access his address book, schedule appointments, and dictate e-mail. Considering the alternative--poking away at the device's tiny display with a skinny stylus--I'm starting to be convinced: this does seem pretty cool.
To the extent that the average person is familiar with speech recognition, she probably thinks of dictating reports to a PC, or maybe dialing an automated call center for flight or train schedules. Indeed, the speech industry has been pushing those kinds of applications over the last decade.
But some of the most novel and most challenging work being done now involves putting speech recognition where it was previously thought infeasible: into toys and MP3 players, car navigation and entertainment systems, and cellphones and PDAs. What's enabling the migration of speech to smaller devices is, on the one hand, efficient speech recognition engines that can handle noise and variations in speech, and, on the other, faster, bigger, and cheaper processors and memory chips on which the engines can run.
The push for embedded speech comes at a time when manufacturers are trying to cram ever more functions into ever smaller devices. "There's just not enough room for all the buttons and displays," says Erik Soule, director of marketing for Sensory Inc. (Santa Clara, Calif.), a developer of embedded speech products. A voice interface that lets you say the name of that Beatles song you want to listen to, rather than delving through your iPod's multiple menus, offers a less frustrating alternative. "We look at voice as a great complement to the visual and touch user interfaces," Soule says.
Will consumers buy it? The Kelsey Group (Princeton, N.J.), one of the few analyst firms that track embedded speech, thinks so. In a white paper issued in July, Kelsey projected that software licenses from embedded speech will grow from US $8 million this year to $277 million in 2006, making it one of the fastest-growing segments of the speech market. That said, speech is not a business where good products translate into easy profits: witness the 1999 collapse of Lernout and Hauspie, until then an industry leader and holder of some of the best technology around.
Still, a wide range of little and big companies are now getting into the embedded speech market. This includes established players, like IBM and Philips [ranked (5) and (24), respectively, among the Top 100 R and D Spenders, which both have higher-end speech recognition products and decades of research experience. It also includes smaller firms like Sensory, Advanced Recognition Technology, and Voice Signal Technologies, which focus on embedded technology [see chart, Who's Getting Into Embedded Speech].
They're betting on a wide range of applications. A few, like voice dialing, have already entered the mainstream, while others, like voice-activated light switches and TV sets, remain a novelty, and still others, like composing e-mail on your cellphone and retrieving directions while driving, lie farther out on the technological horizon.
Under the hood
In a sense, this is old wine in new bottles. The basics of today's speech recognizers were first worked out in the early 1970s by researchers at IBM Corp. and Carnegie Mellon University. Since then, assorted companies and university groups have made incremental advances in the science and technology. It's a truly interdisciplinary field, cutting across computer science, applied math, electrical engineering, linguistics, and cognitive science.
Commercial developers of speech recognition engines tend to be tight-lipped about how their systems work. But speech recognition is actually quite similar across all engines, whether for PC dictation or composing e-mail on a cellphone.
When a person speaks, a microphone converts the sound waves into an analog signal, which is then digitized and sliced up into frames of, typically, 10 or 20 ms. (Each frame is short enough so that its spectral properties are relatively fixed and long enough to capture at least one pitch period.) The engine then extracts from each frame the spectral features it needs and throws the rest away.
Recognizing speech involves matching what is said to a library of known utterances. In theory, you could compare your speech sample to some sort of acoustic database of every known word, spoken in every accent, in every setting, and so on. While you might eventually find the perfect match, the search time alone would rule out real-time applications.
Instead, speech recognition engines settle for what is most likely to have been said. First, the engine compares the spoken sample to its acoustical database of phonemes--the basic sounds that comprise a language--and their variants, called allophones. (The "d" sound in "dog," for example, differs subtly from the "d" in "and" or "address.") English has about 40 phonemes and several thousand allophones. The recognizer's lexicon tells it how these phonemes and allophones combine to form words--that is, how words are pronounced.
The speech engine may also rely on a language model, which tells how words are strung together into phrases or sentences. In the simpler variety, known as a grammar model, the speech engine only recognizes certain utterances in context. If the speaker is asked for her zip code, for example, the engine expects to hear a string of numbers. Grammar models work well for dialogs and command and control applications. But if you don't want the speaker to be quite so constrained, as in dictation, the language model instead relies on statistics about what words tend to occur together. If I say "Dow Jones Industrial," the likelihood of "Average" coming next is nearly 100 percent.
Before a speech engine starts to analyze a sample, there are many equally likely possibilities. Once a few phonemes have been recognized, the least likely possibilities--based on the acoustic and language models--start dropping off. Often, the engine can recognize a word midway through, because the probability is already so high. If, in the end, the most likely match is still poor, or if more than one match is found, the system may ask the speaker to repeat the utterance.
Nearly all the speech recognition engines on the market today are based on hidden Markov models (HMMs), which are used to represent how phonemes and allophones are pronounced and how fast they're spoken. Introduced three decades ago, the HMM's popularity shows no sign of waning. "The advantage of HMM is that mathematically it's extremely elegant, and it's easy to understand and implement," explains Victor Zue, director of the Massachusetts Institute of Technology's (MIT's) Laboratory for Computer Science.
There are other approaches, of course. Sensory and Fonix, another speech engine developer (Salt Lake City, Utah), offer neural network-based engines. Neural nets tend to be better than HMMs for picking out discrete words, but they require extensive training up front. As Sensory's Soule explains: "You tell it what words it should look for and what words it should not look for, and it keeps track of key attributes that are differentiators between the two types of words."
For recognizing continuous speech--whole sentences or strings of numbers--HMMs trump neural nets, which essentially do batch processing. Once the start and stop of a word are identified, "you send the speech off to the neural net recognizer to do its thing," Soule says. In fact, Sensory's high-end recognition engine, Fluent Speech, uses a hybrid approach of neural nets and HMMs.
HMMs also require training, to create the database of pronunciations against which the speech sample is compared. In some applications, the speaker may do the training, and the recognizer then becomes tuned to the speaker's particular style. For speaker-independent applications, the engine designers gather speech samples from many people, to create an averaged set of training data.
"If we're building a speech recognition engine for the car, we go and collect the speech sample in the car, to capture the acoustic characteristics of the chamber that the system's going to be used in," explains Fadi Kaake, general manager of voice control at the Philips Speech Processing office in Dallas, Texas. In fact, collecting good training data is just as important as writing an efficient search algorithm, and companies closely guard how they train their models.
Eventually, researchers hope to augment speech recognition with visual clues like lip-reading (think HAL in 2001: A Space Odyssey) or hand and arm gestures. Speech engine developers also continue to revisit language models, to embed the vocabulary with semantic relationships or include a syntactic structure that allows the engine to look at word occurrences beyond a two- or three-word span.
Getting into embedded
Given that speech recognition is so computationally intensive, how then do speech developers get their engines to run on a cellphone or a PDA?
For starters, they scale the applications to fit the device. At present, the most powerful embedded speech engines on the market--those running on a Compaq iPaq 3800, with its 200-MHz Intel Strong ARM processor and 64MB of RAM, for example--can only recognize a couple thousand words, while voice-activated cellphones with 16-bit digital signal processors (DSPs) can handle perhaps a hundred words. (For comparison, IBM's ViaVoice PC dictation software comes with a vocabulary of about 150 000 words, which the user can expand.)
Limiting the vocabulary's size cuts down the searching that the engine needs to do and the memory it takes up. So while you can't dictate War and Peace onto your Palm Pilot, you can use voice commands to jump through menus or retrieve contacts from your address book.
Embedding speech has been helped enormously by the newest microprocessors and DSPs. Early voice-activated devices relied on dedicated ICs, in which the recognition engine was embedded. Increasingly, though, speech engine software uses the device's own processor and memory. Though not specifically designed to do speech recognition, they might as well be. The single-cycle multiply-accumulate calculations that DSPs excel at, for example, are ideal for speech search algorithms. Some of the processor/DSP chips that are hitting the market, like Texas Instruments' OMAP and Intel's ARM-based XScale, are perfect for speech applications on cellphone handsets, says Sensory's Soule. "You can run your application code on the processor side, and yet you've also got a DSP on the same piece of silicon to support speech algorithms and baseband processing."
At their simplest, voice-controlled devices recognize only a few basic commands--"Lamp, on," "CD player, track 8"--spoken in a certain order and at a certain speed. Everything else they hear is garbage. (It's like the Farside cartoon in which a pet owner lectures the family dog, Ginger; what the dog hears is "Blah blah blah, Ginger, blah blah blah....") If you say the commands in the wrong order--"On, Lamp"--or too quickly, the device's limited recognizer won't understand.
For more complex tasks, you need a more forgiving interface, one that lets you speak like a human, not a robot, and doesn't require you to memorize menu structures or long lists of fixed commands. That is a much harder problem because the speech engine needs to not just recognize the speech, but understand it.
Philips' Max Huang shows how it's done. His prototype Mandarin-language speech recognizer runs on a Compaq iPaq 3600 and takes up only about 200KB of memory, and the models and tables take about 2MB; while it's recognizing, it uses an additional 1MB of RAM.
"Say I want to schedule an appointment with someone in my address book. I can, of course, type the information in, but I can also use voice," Huang explains. Typing means tapping on the miniature keyboard on the 2.3-by-3.0-inch display with a thin stylus. Entering Chinese is particularly tedious, because each character involves three or four successive taps.
Holding in the iPaq's microphone button, Huang says, in Mandarin, "Please find Y.C. Chu for me." The PDA obeys by producing the entry for Y.C. Chu (director of Philips' speech group in Taipei and also Huang's boss). Huang tries again, using a slight alteration: "Find Y.C. Chu." Again, the device brings up the correct entry.
In both cases, he explains, the speech engine first recognizes the sentence--converting it into text--and then extracts its meaning. In this application, each word in the vocabulary has been assigned an attribute--"find" is the command for looking someone up and "Y.C. Chu" is someone to be looked up. (For more straightforward dictation, where words don't need attributes, the PDA's vocabulary goes up to an impressive 40 000 words.) Huang doesn't have to speak to the PDA in any particular way, and, in fact, the PDA is not trained to Huang's, or anyone else's voice--it's speaker independent.
Huang goes next to the PDA's organizer, which allows appointments to be scheduled by filling in short forms. He first uses the simple command and control mode to fill in each field. "New appointment" pops open an empty form. "With Y.C. Chu" transcribes Y.C. Chu's name into the "who" field. "At the office" goes into the "where" field. And so on.
"Now I will encapsulate everything into one sentence," Huang says. " 'I want to see a movie with Lifen Yeh at 10 o'clock tomorrow.' " Within a second or two, the form has been filled out, except for the "where" field. "I didn't say where the movie is, so it left that blank," he explains. To confirm the appointment, Huang opens the e-mail program by simply saying "Send it to her." The recognizer understands that "her" means Lifen Yeh.
Could dictating short messages be the killer app for embedded speech? Something like a billion short message service (SMS) messages are now sent each day in Europe and Asia. This, despite the fact that "the interface for messaging on the cellphone is terrible," says Jordan Cohen, chief technology officer at Voice Signal Technologies (Woburn, Mass.), an embedded speech engine developer.
Dictation on a cellphone may seem like a stretch, but Cohen points out that SMS limits messages to just 160 characters, and the types of messages one might send on a cellphone tend to be limited. Voice Signal hopes to introduce message dictation by year's end, when a universally supported service is set to roll out in the United States.
MIT's Zue, who sits on the Voice Signal board, agrees that speech can be a more convenient interface. "It's natural, it's flexible, it's efficient," he says--but not for every setting. "Even if the system performs flawlessly, do you really want to always be talking to your machine?" Zue says. "With wireless technology, we can sit in a conference room and type to our laptops unobstrusively. But if you start talking to your machine, first of all it's obnoxious, and secondly, there's no privacy."
The art of noise
One of the most compelling places in which to embed speech is in the car. A speech interface promises safer driving through "hands-free, eyes-free" operation of the cellphone, dashboard controls, and navigation system.
But a car has lots of noise: the motor, wind, and road, as well as the CD player, radio, and other passengers. Thankfully, noise and speech frequencies travel differently, explains Chicago-based speech consultant Judith Markowitz. "You try to identify those frequencies that are moving differently from how you would expect speech to move," she explains, "and then strip out a lot of that from the signal--a lot rather than everything, because some of those frequencies are also speech frequencies." Certain noises, like the car's engine at given speeds, can be measured ahead of time, the easier to filter out later.
The microphone's position is also key. Microphone arrays--with one or more mikes pointed at the speaker and others pointed away, to measure noise--are helpful but pricey. The extra microphones, analog-to-digital converters, DSPs, and signal processing involved can add a couple hundred to a couple thousand dollars to the cost.
Despite the technical challenges, several car telematics companies are making inroads. The OnStar system developed by General Motors (2) offers voice dialing and e-mail retrieval by voice. Meanwhile, DaimlerChrysler (4), which has offered a voice interface in its Mercedes Benz line since 1996, has built a prototype system that lets the driver search for and make hotel reservations. And this month, Honda Motor Co. (21) debuted a voice interface in its Accord model; the speech engine is IBM's ViaVoice.
In five year's time, will we all be talking to our machines? As with any new technology, people's attitudes will have to evolve to the point where they need or want it. There's also a generational dynamic at work: younger people tend to be more accepting of voice-controlled gadgets. Then, too, millions of kids are growing up with voice-activated games and toys.
Speech applications of all sorts are also expected to get a boost from Microsoft Corp.'s (12) entry into the field. Although the Redmond, Wash., company doesn't sell embedded speech recognition engines, it has been pushing speech just about everywhere else--its Windows XP operating system includes a voice interface, and it is spearheading standards for voice tagging on the World Wide Web. Its speech research group, with over 100 workers in Redmond and Beijing, is second only to IBM's.
The embedding of speech recognition into everyday devices is seen as the next step on the road to pervasive computing, where computing and communications are available everywhere all the time. Different people have different visions of what this will be like. MIT's three-year-old Project Oxygen, for example, has created an experimental "intelligent room," equipped with microphones, video cameras, and motion detectors. The user's movements are continuously tracked, and he can interact with the space in whatever way he feels most comfortable: speech, gesture, even drawing.
"I think of the next frontier as one in which machines are really not a device that you program, but a partner in conversation--you talk to it, it understands you, and it will try to do things for you," says Zue. "We've been the slaves of our machines, interacting with them on their terms. We want to make machines more intelligent, rather than making humans more obedient."
To Probe Further
Daniel Jurafsky and James H. Martin's Speech and Language Processing (Prentice Hall, 2000).
Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon's Spoken Language Processing (Prentice Hall, 2001).
Judith Markowitz' Using Speech Recognition (Prentice Hall, 1996; reissued by J. Markowitz Consultants, 2000).