Intelligence about baseball had become equated in the public mind with the ability to recite arcane baseball stats. What [baseball statistician Bill] James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible. — Michael Lewis in Moneyball (2003)
Organizations of all sizes are sitting on mountains of data; what they really need are knowledge engineers who can excavate nuggets of valuable information from that data. Earlier this year (in "The Coming Data Deluge," IEEE Spectrum, February 2011), I mentioned the concept of data mining, which uses sophisticated software and database tools to extract nonobvious patterns, correlations, and useful information from large and complex data sets.
Data mining begins with data preprocessing: the gathering of the raw data, which is stored in a data warehouse or data mart. It continues with data cleansing, which removes unrelated or unnecessary data (called dirty data or noise) and looks for missing information.
As the quote from Michael Lewis suggests, the point of data mining is knowledge discovery—the extraction of nonobvious or surprising information hidden in a data set. In data-mining circles, it's axiomatic that the less obvious the knowledge extracted, the more valuable that knowledge is to the organization. Nonobvious patterns represent new opportunities, be it for research, productivity, marketing, or whatever. This is best illustrated by the legendary diapers and beer connection, where data miners allegedly noticed that retail sales of diapers and beer would often spike in tandem. Why? Because new dads asked to pick up diapers on the way home from work would also pick up beer. When retailers stocked the products next to one another, sales of both were said to skyrocket.
Another term for finding previously unseen connections in a data set, especially when there are more than two variables, is pattern mining, and the quarried patterns are called association rules.
Many data sets consist of large amounts of text, such as e-mail, so data-mining projects typically use textual analysis to dredge up connections within that data, a process known as text mining. Another promising avenue is audio mining (also called audio indexing), which is the process of extracting and indexing the words in an audio file and then using that index as data to be otherwise mined. It will come as no surprise that engineers have also come up with ingenious methods for indexing other types of media, including image mining and video mining. If the data set consists of geographical information, it is called spatial (or geospatial) mining. In this increasingly social world, researchers are turning to crowd mining, where they try to unearth useful knowledge from large databases of social information. On a more general level, Web mining refers to the harvesting of useful patterns from data sets of Web content, Web usage (such as server logs), and Web structure (such as hyperlinks).
If a data set is just too large to probe efficiently, data miners can often get away with sampling portions of it, a technique variously known as data dredging, data fishing, or data snooping.
Data mining sounds innocent enough on the surface, but privacy advocates warn that it can be used for nonbenign purposes. When Internet service providers and companies such as Google hoard massive data sets that detail the online activities of hundreds of millions of people, automated data mining methods can analyze that data to look for patterns of suspicious activity. As computer scientist Jonathan Zittrain has pointed out, "When governments begin to suspect people because of where they were at a certain time, it can get very worrying."
Whether it's a boon or a bane, informative or intrusive, you've seen here that the field of data mining is a rich source of new words and phrases. As I see it, my job here at IEEE Spectrum is to sift through the raw material of articles, papers, blogs, and books to uncover new lexical gems and then present them to you in this column. Call it word mining.
This article originally appeared in print as "The Data Gold Rush."