Getting The Message

It ain’t just what you say, it’s the way that you say it

10 min read
Opening illustration for this feature article.
Illustration: Jonathan Barkat

Much of the sophisticated surveillance equipment that the United States used to win the Cold War is hardly tailored to the “war” on terrorism, say intelligence observers both inside and outside the government. For one thing, reconnaissance satellites developed to measure every cubic yard of concrete poured into Soviet airstrips, missile silos, bunkers, and radar posts are overkill for keeping tabs on a mobile enemy who may build perfectly adequate facilities out of scrap lumber and plastic sheeting. Similarly, undersea listening posts designed to catch stealth submarines may be useless against weapons that enter the United States by rustbucket freighter.

But sheer volume of data is an even bigger problem [see “Listening In”]. Intelligence aimed against “asymmetric warfare,” says Robert Popp, deputy director of the Defense Advanced Research Projects Agency’s (DARPA’s) Information Awareness office, must be able to pick up on clues generated by the behavior of a few isolated conspirators scattered across the globe while ignoring the billions of conversations and other transactions conducted by innocent individuals. Intelligence analysts need ways of detecting and recognizing the significance of such unorthodox activities as the anomalous flight training that preceded the attacks against the World Trade Center and the Pentagon. Being able to spot known terrorists’ faces in airports, recognize their voices reliably on tapped phone lines, or discover what they are saying in a timely fashion would also be nice.

Even if it proves beyond the capability of current computers to translate conversations in or near real time, it should be possible to examine those conversations for other kinds of information—the emotional state of the speakers, for example. This automatic presorting, or filtering, can direct the efforts of human translators where they are likely to do the most good.

So how is the intelligence community going about pursuing its goals? Some idea may be gathered from the projects that DARPA and the National Science Foundation are funding at universities and corporate research centers. Lately, computer scientists at IBM; the universities of California at Irvine, Cornell, and Rutgers; SRI International; and a host of other institutions have found funding increased or redirected toward intelligence needs.

While Congressional critics of the pervasive surveillance entailed by DARPA’s Total Information Awareness program have threatened itsfunding, the agency’s project list still provides a good benchmark.

Mining for bits

Not surprisingly, data mining and other techniques for extracting coherent patterns of information from a flood of bits are near the top of the new research agenda. Multilanguage projects naturally get preference. Several of them hold out hope that computers will be able to correlate information in multiple languages without going to the trouble of machine translation.

A case in point: Kathleen McKeown and her colleagues at Columbia University (New York City) will be building on Newsblaster, a program that scans 60 news services from around the world in order to collate and summarize accounts of each day’s news. The team will extend Newsblaster’s classification methods to work with the less structured texts found in intercepted e-mail messages, chat sessions, and speech transcripts, and will also improve the system’s analytical tools. Accordingly, instead of summarizing an e-mail exchange, the extended software might simply list the points made by each participant, with links to the appropriate passages for analysts who want to look more deeply.

Barely as yet on McKeown’s to-do list is the optimization—and perhaps parallelization—of the classification algorithms, which currently take more than eight hours on a fast PC to generate each day’s summaries. Intelligence analysts work with much more data than Newsblaster—more, in fact, than any reasonable computer system can handle. Consequently, several of the other research projects funded by the NSF are aimed at algorithms for examining streaming data—looking at it just once as it goes by, to decide whether it’s worth saving and analyzing in detail.

Many of the tools that the intelligence community appears to be looking for won’t detect terrorist threats by themselves, says Fred Roberts, director of discrete mathematics and theoretical computer science at Rutgers University (New Brunswick, N.J.). Instead, they’ll recognize that something is going on, and pass the buck to a human being to determine exactly what that something is and what to do about it.

Such decision support systems are urgently needed at all levels of the intelligence community. One project cited by NSF program director Gary Strong would put secure PCs and software on the desks of government lawyers and judges so that the evidence and arguments for surveillance warrants may be presented without generating piles of paper.

At DARPA, Popp’s vision of the Total Information Awareness program includes a slew of projects to introduce collaborative software for what is called structured argumentation. This process allows intelligence analysts to make their assumptions explicit when working together, so that their colleagues may examine the evidence underlying particular assertions. The extra step streamlines the business of checking out multiple hypotheses about what adversaries might be doing. It also clarifies things when information is transferred up the command chain to the planners and policymakers who must ultimately frame the nation’s response to any threat.

That pesky U.S. Constitution

The rush to develop all of these complex data-mining and information-handling systems presumes, of course, that the data will be there to feed their algorithms once those algorithms are perfected. At the moment, a host of roadblocks stand in the way, everything from basic database-formatting incompatibilities to federal statutes and the U.S. Constitution. Roberts posits a continual scan of all antibiotic purchases at pharmacies around the country, which would help give early warning of a bioterrorist attack but open the data to misuse in an enormous number of ways.

So database developers must find ways—known in theory but seldom implemented—to extract needed information without turning the United States into a surveillance state. One possibility, according to Popp, would be to “anonymize” transaction records so that intelligence and law enforcement experts could put together relationships among suspects and clues pointing to illegal acts without finding out who their targets are until the level of evidence satisfies a judge or prosecutor [see illustration, “Separating Deed From Doer”].

graphic illustration Separating Deed From Doer: To get a jump on suspicious activities, DARPA's Total Information Awareness system would monitor and store the ordinary transactions (green box) of all citizens but keep separate the identities of who made them (blue box). Transaction patterns suggestive of terrorist activity—someone overstays a visa, buys explosives—would be sorted by interest filters. But, to ensure privacy further, pointers to the location of the patterns and not the data itself would be stored in virtual data repositories. Only a court order could break down the privacy wall and match individuals to suspicious transactions. The data analysis tools would serve both to model real-world activities and to apply the models to improving the interest filters.

As envisioned by DARPA planners, the Total Information Awareness system would collate transactions from a tremendous range of sources. Banks, hospitals, airlines, and Web sites are the most obvious. Patterns that might be connected to terrorist activity or might help elucidate patterns of relationships among terrorist suspects and others would be winnowed out and stored in a virtual repository. Here, the repository’s virtual nature is key to avoiding both the personal privacy and national security debacles that could spring up when detailed dossiers on millions of people are stored in a single place.

Instead, the virtual repository would be capable of assembling the needed data when asked, and that would be done only when a legally binding determination has been made that the barrier between personal identification data and transaction records can be breached.

In something of a Catch-22, researchers trying to develop new data-mining and pattern recognition techniques have no access to the kinds of information that their algorithms may ultimately be sifting. In other words, they don’t have good raw material with which to develop and test their algorithms. Several NSF and DARPA projects are aimed at filling that deficiency. Meanwhile, researchers use stopgaps: McKeown and company are running news-extraction algorithms on their own saved e-mail, while Roberts’s colleagues at Rutgers are honing their event-detection skills on Medline journal abstracts and Reuters news stories.

Elsewhere, both DARPA and NSF are funding research projects to build on current capabilities in speech recognition, including the keyword scanning reportedly deployed in the Echelon global telecommunications surveillance system. Also, DARPA’s HumanID program is intended to push the current state of the art in face-matching recognition and other methods of spotting people at a distance.

Speech recognition is already widely used in commercial applications such as specialized dictation and telephone voice menus; but it’s much harder to convert speech into text when subjects have no intention of getting their meaning across to a computer. Hence DARPA’s Effective Affordable Reusable Speech-to-Text (EARS) program, which could provide relatively inexpensive multilingual speech-to-text conversion a few years or more down the line.

Hearing between the lines

An NSF-sponsored project on “talk-printing” may give a sense of where the state of the art is going. Elizabeth Shriberg, Andreas Stolcke, and Kemal Sönmez of SRI International (Menlo Park, Calif.) are utilizing variations in pitch, rhythm, and speech volume—information that speech-recognition programs typically throw out—to refine word and sentence recognition, to identify speakers, and even to tell casual chats from serious discussions or the dissemination of orders and instructions.

Collectively, these variations in speaking style are known as prosody. They have traditionally been viewed as statistical noise that speech recognition programs must filter out while finding the best match between a series of 10- or 20-millisecond sound samples and a database of likely words or phonemes. But for the SRI group they are precisely what turns a string of sounds into information. Prosody can help analysts make sense of otherwise ambiguous transcriptions, says Stolcke, pointing out that conventional recognition tools would show no difference between “Don’t go!” and “Don’t! Go!”

Prosodic clues, says Shriberg, can also sort out relationships among speakers. Conversations among family members have different rhythms from those among colleagues, bosses, and subordinates, and both differ from conversations among strangers, business acquaintances, or enemies. This kind of information can be particularly important for intelligence organizations trying to put together a picture of hostile groups.

By analyzing speaking styles, it may be possible to tell when people are using code words to discuss illicit business

Shriberg goes further. By comparing suspect conversations to normal ones using the same words and looking for unusual inflections, it may even be possible to detect when people are using code words to discuss illicit business.

One of the current problems with applying the technology to intelligence, according to Shriberg, is a lack of good sample data. Name a good source of recorded conversations between wrongdoers, and it’s classified, under court seal, or protected by privacy laws. Asking people to simulate such conversations just yields recordings of the speech patterns actors think villains use.

Prosodic analysis requires only a fraction of the processing power of conventional speech recognition, which typically takes at least a second of processor time per second of speech, so it could help focus processing resources in settings where hundreds or thousands of hours of conversations need sifting every day. Simple applications of the technology, like detecting when the user of a voice menu system is getting frustrated, could come within a year; the most sophisticated ones, like detecting the different intonations and phrasing people use when they’re trying to keep a secret, might be five years away, she speculates.

As Shriberg explains, prosodic analysis typically uses a decision tree approach in which a series of fairly simple decisions combine to yield a rather sophisticated result. To determine whether a person is calm or angry, for instance, his or her utterances could be analyzed in accordance with the decision tree shown overleaf. The parameters for which the system looks are selected by experts in prosodic analysis so as to capture features that distinguish emotions and also lend themselves to extraction by automatic means.

To be effective, the features must be normalized for the individual speaker, the word content, and the channel.

Maximum normalized pitch, for example, captures how high the person’s intonation reaches within his or her own range. Higher pitch can be associated with more anger, although not always [see figure, “Zeroing In On Anger”].

graphic illustration Zeroing In On Anger: Rip-roaring mad? Or cool as a cucumber? In sifting through billions of voice messages, analysts are more interested in angry than in calm speakers. A decision-tree classifier like this one is being tweaked to assess a person's state of mind. It classifies speech by parameters like pitch, vowel duration, and average energy (loudness), normalized for the speaker, word content, and channel. It also checks whether pitch falls steeply at the end of a sentence (red boxes). At each branch, the classifier assigns probabilities that someone is angry, or calm, in the light of empirical experience (yellow boxes). Presented with a high-pitched utterance, the classifier assigns a probability of 0.62 to anger—hardly definitive. So if a further test finds a high maximum vowel duration, a probability of 0.91 for anger is assigned—pretty conclusive. If the vowel duration is low, then the probability of anger is only 0.54, so a third test is applied. And so on.

The shape of the pitch contour also matters. Angry speech is rarely uttered as a question; it usually has a large falling intonation pattern at the end of the utterance (which gives it emphasis). Angry speech also tends to contain syllables or words that are more drawn out in time; thus features like the duration of the longest vowel (after normalizing for the identity of the vowel) are useful.

In addition, speakers tend to get louder when they are angry, so loudness (after normalization for the inherent energy in the voice and channel) is yet another source of information.

When a new utterance is presented to the classifier, its prosodic features are examined for their values, and the utterance is assigned probabilities of indicating anger versus calm, based on its path through the classifier. Presented with a new utterance that is high in pitch and high in maximum vowel duration, the classifier might assign the utterance a probability of 0.91 for anger. In contrast, an utterance high in pitch but with low vowel duration and no pitch fall might be assigned only a probability of 0.12 for anger. Utterances with low pitch and low loudness are very likely to indicate calm. And so on.

ID-ing faces—and other body parts

The goal of DARPA’s HumanID program is to develop technology for recognizing people at a distance. It faces tough challenges. The agency has organized a couple of controlled tests of commercial face-recognition systems, but none has demonstrated the kind of selectivity required for large public venues—airports, say, which may serve 50 000 nonterrorists a day.

Program manager Jonathon Philips holds out hope for newer techniques still in development, like gait recognition, which recognizes the hitches and rhythms characteristic of a person’s walk. It generally operates on silhouettes, using rudimentary image processing to extract moving outlines from a background. Another approach uses a small radar to detect the characteristic movements of the different parts of a pedestrian’s body.

Although gait recognition looks promising, research—much less development—is still at a fairly early stage. DARPA’s plans for the coming year include building a database of video footage with people walking so that researchers will have reproducible standards to test their systems against. Even with rapid progress, it will be years before gait recognition reaches the field.

Where do all these R&D efforts leave intelligence agencies? In those where investigators already have networked computers on their desks, research results that augment existing data-mining and analysis software could bear fruit quite quickly. Elsewhere, high-tech antiterrorist tools appear to be at the beginning of a lengthy development and procurement cycle, comparable to that of major weapons, like tanks and aircraft.

To Probe Further

For the Defense Advanced Research Project Agency’s official description of its Total Information Awareness program, go to http://www.darpa.mil/iao/TIASystems.htm.

To learn about the European Parliament’s Echelon eavesdropping network, visit its Website at http://www.europarl.eu.int/committees/echelon_home.htm.

The Conversation (0)

How Can We Get​ Blockchains to Talk to Each Other?

The field is fragmented, but common protocols are on the way

4 min read
An artists impression of two blockchain blocks trying to communicate with each other.
iStockPhoto

This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

Keep Reading ↓ Show less

Clever Compression of Some Neural Nets Improves Performance

MIT researchers find an efficient way to prune speech-recognition AIs while still boosting accuracy

3 min read
Icon based illustration showing an AI language processing cycle.
iStockPhoto

As neural networks grow larger, they become more powerful, but also more power-hungry, gobbling electricity, time, and computer memory. Researchers have explored ways to lighten the load, especially for deployment on mobile devices. One compression method is called pruning—deleting the weakest links. New research proposes a novel way to prune speech-recognition models, making the pruning process more efficient while also rendering the compressed model more accurate.

The researchers addressed speech recognition for relatively uncommon languages. To learn speech recognition using only supervised learning, software requires a lot of existing audio-text pairings, which are in short supply for some languages. A popular method called self-supervised learning gets around the problem. In self-supervised learning, a model finds patterns in data without any labels—such as “dog” on a dog image. Artificial intelligence can then build on these patterns and learn more focused tasks using supervised learning on minimal data, a process called fine-tuning.

Keep Reading ↓ Show less

In the future, the spectrum that is being considered for future 6G networks will be in frequencies that have never been used in the history of any wireless communications, turning to the terahertz (THz) frequencies.

Researchers are now specifically suggesting frequencies from 100 GHz to 3 THz as promising bands for the next generation of wireless communication systems because of the wide swaths of unused and unexplored spectrum contained within them. These frequencies also offer the potential for revolutionary applications that will be made possible by new thinking, and advances in devices, circuits, software, signal processing, and systems.

Keep Reading ↓ Show less