A billion records and counting
Of the many announced reforms, probably the most provocative is the FBI's plan to engage in investigative data mining and data warehousing, with a view to detecting and connecting the traces of terrorist and criminal activity. Details are still sketchy, but presumably it would copy the techniques the commercial sector uses to track and predict consumer behavior, prevent IT network break-ins, and so on. (Repeated requests for interviews for this article were declined or went unanswered by bureau press officers; FBI contractors referred all questions to the bureau.)
"Data warehousing involves connecting various datasets from various sources--transactional data from your Web site, demographics data from providers like Axciom and Experian--and then using analytical software to detect patterns in the data, so that you can personalize the services you offer or detect fraud," explains data mining expert Jesus Mena.
There are two basic approaches, he says. "In the first, you look for outliers or deviations, things that are way outside normal behavior--somebody trying to access a computer network in the middle of the night, for example. The other is where you have a pattern of known activity and you have a signature that you try to match."
Mena's book Investigative Data Mining for Security and Criminal Detection (Digital Press, 2003) discusses how these commercially available techniques can be applied to law enforcement and intelligence. The FBI, he notes, has long been a customer of ChoicePoint (Atlanta, Ga.), which collects and sells consumer information. To detect criminal or terrorist behavior, he says, one would overlay that data with data from law enforcement (for example, arrest records, photographs, and fingerprints), immigration (visa records and border crossings), and intelligence (terrorist watchlists and the like).
At least in theory, this is exactly what the FBI needs in order to know what it knows. It has amassed criminal and intelligence-related data galore--over a billion records, by one bureau estimate, stored in many databases at dozens of sites. Only a fraction of the FBI's data is in a common format that can be easily searched, analyzed, and shared. The agency's $680 million Integrated Automated Fingerprint Identification System (IAFIS), for example, contains millions of digital fingerprint records and has cut search time from weeks to hours. But it is not directly linked to the FBI's main network for handling case files, which is a text-only system. What's more, many state and local agencies still lack the equipment to access the IAFIS and upload and download prints. Nor, needless to say, is there a universal interface for allowing the databases at all the agencies to talk to one another, although some data exchanging--between, for example, the FBI and the U.S. Immigration and Naturalization Service (now the Bureau of Citizenship and Immigration Services)--has begun since 9/11.
Reportedly, the bureau now maintains a production line of scanners and optical character recognition software to convert some 750 000 paper documents a day into electronic text. Key files relating to counterterrorism going back 10 years, some 40 million or so pages, have already been converted. Still, at the current rate, it will take more than three and one-half years to convert the rest. And more paper is being generated all the time.
Take the FBI's handling of last fall's sniper attacks in and around Washington, D.C. As described by William Hooton, assistant director of the FBI's records management division, in a 14 November speech to the Association for Information and Image Management (Washington, D.C.), the bureau set up a phone center to field tips from the general public. Staff members duly logged each call on paper forms, which were collected every hour and taken to FBI headquarters, where they were scanned and the digital images fed into a bureau-wide database.
All the same, as an article in Federal Computer Week pointed out, a scanned handwritten note is not an electronically searchable file. That may explain why the bureau did not discover until after the fact that eyewitnesses had reported spotting the suspects' car, including its New Jersey license plates, at a handful of the crime scenes.
"They're seen at one crime scene and then they're seen again at another one miles away. That's an incident in itself--why were they there?" observes Mena. "That's clearly a failure to connect the dots, to see a recurring pattern of sightings of a car with out-of-state plates."
So-called free forms, of the kind used in the sniper attacks, present an enormous obstacle for data analysis, Mena says. "Someone might describe an individual as being tall or having an accent or dressed a certain way, and different investigators will enter that information differently." The solution, he says, is "to standardize from the beginning, so that you use checklists, as opposed to free forms, to capture the data." Text-mining software from companies like Autonomy, HNC, and IBM could then be used to categorize and organize the raw data automatically.
The FBI has not revealed whether or to what extent it has implemented such techniques. Last September, though, Mark Tanner, the FBI's information resources manager, told Government Executive magazine that he receives "probably 10 to 15 calls or e-mails a day from [vendors] who have solutions to these problems," but "we're unable to really implement them... because we don't have the infrastructure."
Standards matter all the more when information must be shared across agencies. The Department of Justice, which oversees the FBI, actually has a standards registry for just that purpose (see http://it.ojp.gov/jsr/public/index.jsp). It covers everything from message sets (IEEE 1512) to "the Interchange of Fingerprint, Facial, Scar Mark and Tattoo (SMT) Information." XML, the Extensible Markup Language, is one of the most widely discussed, in the FBI and elsewhere; there's now an XML standard for rap sheets and criminal histories. Taking that one step further, a group called the Organization for the Advancement of Structured Information Standards formed a technical committee in January to develop an XML framework for sharing criminal and terrorist evidence.
Filtering data in hopes of detecting a criminal or terrorist plot is not easy, Mena cautions. [The German federal police's recent exercise in data mining proved this to be true; see sidebar, "The German Solution"]. Unlike consumers, terrorists are not prone to repetition. "So you have to anticipate new types of attacks--bombings, or bioterrorism, or other activities," he says. And all the data mining in the world will never trace the hand-written, hand-delivered messages that Osama bin Laden's Al Qaeda operatives allegedly use. "That's why a combination of human knowledge and machine learning is the best approach," Mena says.
Privacy and civil liberties advocates have a larger concern. "The FBI will now be conducting fishing expeditions using the services of the people who decide what catalogs to send you or what spam e-mail you will be interested in," says James Dempsey, executive director of the Center for Democracy and Technology (Washington, D.C.). "The problem is, the direct marketers can only call you during dinner time or mail you another credit card offer based on that information--the FBI can arrest you."
"We don't want to arrive at a situation where individuals are reluctant to, let's say, purchase a copy of the Koran from Amazon.com," agrees Steven Aftergood, a senior research analyst at the Federation of American Scientists (Washington, D.C.). "That would be intolerable." There need to be realistic error-correcting procedures, which in many cases do not now exist, not just for statistical or data-processing errors, but also for those introduced by "willful, deliberate abuse," he says. "The error-correction process should not be a knock on the door from the FBI."
Even with perfect data, data mining may yield a completely inaccurate picture. "It is all too easy to do Monday-morning quarterbacking and say 'Why didn't you connect the dots to see that stick of dynamite?' when in fact the same dots could be connected just as well to show a duck or a coffee mug," one computer expert with extensive training in intelligence work told IEEE Spectrum. Skeptical of both the technical capabilities and the political ramifications of the FBI's expanded surveillance efforts, he believes the technology will be "totally ineffective in its professed purpose [of catching terrorists] but too effective as a domestic police state tool."