Winner: A Fountain of Knowledge
2004 will be the year of the analysis engine
The great strength of computers is that they can reliably manipulate vast amounts of data very quickly. Their great weakness is that they don't have a clue as to what any of that data actually means.
Computer scientists have been laboring for decades to eliminate that weakness, with some limited successes in some limited domains. Now, IBM Corp. appears to have made a major breakthrough in the field of machine understanding. The results could spell big business not just for IBM but for data miners, content providers, retailers, political consultants, market analysts, and any other group that relies on information as part of its stock in trade.
Right now, the standard tools for dealing with this glut of information are search engines, of which Google reigns supreme. But the goal of these search engines is to direct a user to a small selection of documents that simply contain words that match a set of search terms. Even though millions of documents may be returned, the search engines try to sort the results so that only a handful of the most relevant matches are presented first.
For example, imagine a marketing researcher trying to find out the online attitude of consumers toward the popular rock singer Pink. The researcher would have to wade through an ocean of search results to sort out which Web pages were talking about Pink, the person, rather than pink, the color.
What such a researcher needs is not another search engine, but something beyond that--an analysis engine that can sniff out its own clues about a document's meaning and then provide insight into what the search results mean in aggregate. And that's just what IBM is about to deliver. In a few months, in partnership with Factiva, a New York City online news company, it will launch the first commercial test of WebFountain with a service that will allow companies to keep track of their online reputation--what journalists are reporting about them, what people are writing about them in blogs, what people are saying about them in chat rooms--without having to employ an army of full-time Web surfers.
But how is WebFountain able to do that? Up to now this kind of aggregate analysis was possible only with so-called structured data, which is organized in such a way as to make its meaning clear. Originally, this required the data to be in some sort of rigidly organized database; if a field in a database is labeled "product color," there is little chance that an entry reading "pink" refers to a musician.
WebFountain works by converting the myriad ways information is presented online into a uniform, structured format that can then be analyzed. The goal is to provide a general-purpose platform that can allow any number of so-called analytic tools to sift the structured data for patterns and trends. Creating the needed structure automatically is WebFountain's big advance, because it requires at least some understanding of what the information actually means.
A key element in WebFountain's ability to understand documents is to employ a more flexible approach to structuring data than putting it in a database. One method popular in recent years is adding text labels to documents that tell a computer what the various elements of the content mean using the eXtensible Markup Language (XML). Things such as price or product identification numbers are identified by bracketing them with so-called tags, as in <product-name> Deluxe Toaster </product-name>, <price> $19.95 </price>.
The primary advantage over the database approach is that documents are more easily shared. A company can post its XML-tagged catalog on its Web site, and potential customers can then scan the catalog with software that automatically compares prices and other details with the similarly tagged catalogs of other companies.
But many online information sources are entirely unsuited to the XML model--for example, personal Web pages, e-mails, postings to newsgroups, and conversations in chat rooms. XML requires that the elements of a document be broken down into clear, unambiguous categories. E-mails or instant messages can't be labeled in this way without destroying the ease of use that is the hallmark of these ad hoc communications; who would bother to add XML labels to a quick e-mail to a colleague?
Goal: Create a system that can convert the anarchy of online data, Web pages, e-mail, chat rooms, and more into a format that can be analyzed to identify commercially valuable information
Why it's a Winner: By leveraging the vast reserves of untapped information online, WebFountain will let companies make smarter business decisions while creating an open platform that will encourage more machine understanding research to leave the laboratory
Organization: IBM Corp.
Center of Activity: IBM Almaden Research Center, San Jose, Calif.
Number of People on the Project: 120
Budget: Over US $100 million
Even for those sources where it is appropriate, few outside the world of professionally produced documents have the inclination or the tools to use XML. As result, most digital information is in the form of unstructured data--the splendid jumble of the Web or the messages in your e-mail inbox--and that is unlikely to change any time soon.
WebFountain solves this problem with a piece of alchemy: it transforms unlabeled data into XML-labeled data. The technology behind this transmogrification from unstructured to structured data is at the heart of the project [see diagram, " Alchemy in Action"].
Unstructured data from the Internet enters the WebFountain system in much the same way that search engines gather data. A program, known as a spider or a crawler, wanders the Web, storing the text of each page it comes across in a database and following any links it finds to new pages. (In practice, several spiders running on separate computers crawl the Web simultaneously, feeding their findings into a common pool.)
Although the pooled data is compressed to about one-third its original size to reduce storage demands, WebFountain still requires a whopping 160 terabytes plus of disk space. It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.
To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4--5 terabytes of disk storage.
The main cluster is where the process of digesting raw documents into structured data begins. Extraneous content, such as the standard navigation links common to all the entries on a community blog site, is stripped away. Then a set of up to 40 software programs dubbed annotators begin converting documents into a standardized form that can be analyzed. Each annotator scans the document looking for words and phrases it recognizes. When it does recognize something, it inserts an identifying XML tag into the document.
Different annotators specialize in looking for different things. For example, one annotator specializes in spotting geographic information. If it comes across the words "Mount Fuji," it will add a tag that identifies the words as a geographic reference and might add the longitude and latitude of Mount Fuji. In this way, a sentence that originally read "We visited Mount Fuji and took some photos" would become something like "We visited <geo-ref long=138E lat=35N>Mount Fuji</geo-ref> and took some photos."
By the time the annotators have finished annotating a document, it can be up to 10 times longer than the original. These heavily annotated pages are not intended for human eyes; rather, they provide material that the analytic tools can get their teeth into. For example, a data-mining program looking for tourists who have visited Japan would be able to search for geographic references that fall inside that country's borders.
The annotation process is what creates a structured store of data that can be analyzed. Some of the annotators are relatively simple and identify such things as links to other documents or the language and character set used. Others are much more complex and begin the tricky process of inferring the meaning behind the words in a document; these annotators rely heavily on a set of so-called knowledge bases that contain information about the real world.
Some of these knowledge bases are publicly available, such as the U.S. Securities and Exchange Commission's database of publicly traded companies or Stanford University's TAP database, a semantic Web platform that tries to capture common-sense knowledge about the world and the relationships among the things in it. Other databases are proprietary to IBM or its partners and describe information about particular knowledge domains--for example, annotators that understand terms used in the petrochemical or pharmaceutical industries.
Perhaps the toughest task in the hunt for meaning performed by the annotators is disambiguation --choosing the intended meaning from a range of possibilities. This might occur, say, in a situation when the geographic annotator comes across the word "Dublin." Is this a reference to Dublin, Ireland, or Dublin, Ohio?
The annotator must fall back on a combination of the information contained in the knowledge base and the examination of any other geographical terms identified in the document. If it spots references to "Ireland" or "Galway," the Dublin meant is probably the capital city of the Irish republic, while mentions of "Ohio" or "Columbus" indicate that the home of the Wendy's fast food empire is intended instead. If all else fails, the larger population of the Irish city versus the one in Ohio will tilt the odds in its favor.
This approach can be tripped up, however--for example, when the geographic annotator comes across a blog entry written by an Irish Dubliner about his vacation to New York City. The U.S. geographic references will tend to swing the pendulum in favor of interpreting Dublin as being in Ohio.
WebFountain's builders admit it's not always able to guess right, but they point out that humans can also be confused by ambiguous meanings. "If there's real confusion in the document, we promise to faithfully capture it," laughs Dan Gruhl, the analysis engine's chief architect, before leading me into one of WebFountain's server rooms.
Gruhl had to do his explaining before he led me into the server room because this kind of semantic analysis, it turns out, can be deafening. Imagine the roar of hundreds of tons of air conditioning along with the whirring of hundreds of computer cooling fans and disk drives, and you get an idea of what it's like to wander the WebFountain server rooms, where refrigeration keeps all the equipment from overheating.
When the first hardware for the current WebFountain architecture arrived in 1999, adequate air conditioning hadn't yet been installed, so Gruhl and his team brought in every fan they could find from their homes and used them to cool the server room. The team has gone through about seven incarnations of equipment since they started developing the analysis engine.
WebFountain's hardware is upgraded about every nine months, and it takes a small army of workers to deploy a new rack setup across the system. "The last time, I got a call from the person in charge of receiving," Gruhl shouted over the noise. "He wanted to know when somebody would be down to pick up the equipment that was filling his entire loading dock to a height of six feet."
Once documents have been annotated in the main cluster, another series of specialized machines go to work, using clues such as how Web pages link to one another (similar to the technique Google uses to determine relevance rankings for its search results) to gain additional insight into the significance of a document. The documents are then handed off to another cluster that performs high-level analyses. Because the data has been converted from an unstructured format to a structured XML-based format, IBM and its partners can fall back on the data-mining experience and methodologies already developed for analyzing databases. The structured format also provides an easy target for developing new analytic tools.
WebFountain is not intended for casual surfers. Its target audience includes the business executives who have already shown they are willing to pay for the insights that mining corporate databases can supply. Analytic tools can ferret out patterns in, say, a sales receipt database, so that a retail store might see that people tend to buy certain products together and that offering a package deal would help sales. WebFountain will allow executives to go beyond their own databases and analyze up-to-date information from any online source.
But the complexity involved in performing these analyses means that it will be a long time, if ever, before executives will be able to directly access WebFountain the way millions of users access Google daily. Rather, IBM intends to work with partners in different industries who can set up queries tailored to specific customers.
Once the queries have been set up, executives will be able to access the results via Web sites maintained by these partners. For example, once a week WebFountain could provide a report on gardening-related mailing lists and chat rooms. If a particular product is mentioned frequently, it might be worthwhile for a hardware chain to stock more units (or, if the buzz about a product is bad, to quietly retire it from the shelves).
In fact, it is detecting just this kind of buzz that will be the foundation of the first commercial test of WebFountain. In the second quarter of this year, IBM's partner Factiva, which distributes Dow Jones and Reuters news content, will launch its WebFountain-based service that will track the online reputation of companies.
Factiva believes companies have very good reason to care enough about their online reputation to pay for this service, which will cost between US $150 000 and $300 000 a year. According to Denis Cahill, Factiva's associate vice president of architecture, people increasingly base their buying decisions on the reputation of the seller. "More and more, what people are buying is a company's reputation," he told IEEE Spectrum, pointing to recent research by corporate consultant Deon Binneman that indicates that as much as 40 percent of a company's market cap is directly linked to its reputation. And with the growth of online communities, it's easy for companies to unexpectedly find themselves the targets of negative public opinion. "Issues that used to take years to pop up now take months or weeks," says Cahill.
Factiva's service will allow companies to keep close tabs on their online reputations and address problems before unhappy consumers start turning up their noses--or worse, start organizing protests. Not only will customers get the benefit of being able to scan the Web at large, but Factiva's huge database of news articles will also be fed into WebFountain for analysis.
By forming their own alliances with IBM, other content providers, such as industry news organizations or technical publishers, could generate new revenue streams with WebFountain. By combining their current and archived proprietary material (like news articles or scientific papers) with material freely available online, they could offer to perform customized searches for the industries they cater to.
For example, a biomedical publisher could use WebFountain to search its articles and the Web for specific drugs on behalf of pharmaceutical companies. Currently, it is not unknown for pharmaceutical companies to start researching a potential new drug, only to discover later that it had already been tried and failed, with the sole evidence of its passing a simple footnote in a U.S. Food and Drug Administration report or a brief article in a minor pharmaceutical journal.
The information that is fed into WebFountain can also include internal corporate information, such as e-mails and reports, provided by the customers of IBM's WebFountain partners.
Bob Carlson, IBM's vice president for WebFountain, told Spectrum what this could mean in practice. The head of a research and development department could feed WebFountain all the e-mails, reports, PowerPoint presentations, and so on that her employees produced in the last six months. From this, WebFountain could give her a list of technologies that the department was paying attention to. She could then compare this list to the technologies in her sector that were creating a buzz online. Discrepancies between the two lists would be worth asking her managers about, allowing her to know whether or not the department was ahead of the market or falling dangerously behind.
The data miners and annotators, similar to the data itself, are a combination of publicly available and proprietary software. Beyond the low-level annotators that are used on all documents, which annotators and miners a document is passed through depends on the final customer's needs and can include miners specific to a partner.
Because new miners can easily be plugged into the existing architecture, business intelligence software developers can focus on, say, creating the best online sentiment analyzer around without also having to learn how to manage a dozen data formats or how to organize terabytes of data. Ultimately, boutique data miner programmers may be able to sell their analysis tools to either IBM or one or more of its partners, greatly reducing the barriers involved in commercializing such technology.
This, perhaps more than anything else, is why WebFountain looks like a winner. By creating an open commercial platform for content providers and data miners, it will foster rapid innovation and commercialization in the realm of machine understanding, currently dominated by isolated research projects. This would herald a sea change in our ability to use computers to generate insight and understanding that directly affect the bottom lines of businesses, something that is unfortunately all too rare with current IT systems.