IT Has 26 Words for Data Mining

Intelligence about baseball had become equated in the public mind with the ability to recite arcane baseball stats. What [baseball statistician Bill] James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible. — Michael Lewis in Moneyball (2003)

Organizations of all sizes are sitting on mountains of data; what they really need are knowledge engineers who can excavate nuggets of valuable information from that data. Earlier this year (in "The Coming Data Deluge," IEEE Spectrum, February 2011), I mentioned the concept of data mining, which uses sophisticated software and database tools to extract nonobvious patterns, correlations, and useful information from large and complex data sets.

Data mining begins with data preprocessing: the gathering of the raw data, which is stored in a data warehouse or data mart. It continues with data cleansing, which removes unrelated or unnecessary data (called dirty data or noise) and looks for missing information.

As the quote from Michael Lewis suggests, the point of data mining is knowledge discovery—the extraction of nonobvious or surprising information hidden in a data set. In data-mining circles, it's axiomatic that the less obvious the knowledge extracted, the more valuable that knowledge is to the organization. Nonobvious patterns represent new opportunities, be it for research, productivity, marketing, or whatever. This is best illustrated by the legendary diapers and beer connection, where data miners allegedly noticed that retail sales of diapers and beer would often spike in tandem. Why? Because new dads asked to pick up diapers on the way home from work would also pick up beer. When retailers stocked the products next to one another, sales of both were said to skyrocket.

Another term for finding previously unseen connections in a data set, especially when there are more than two variables, is pattern mining, and the quarried patterns are called association rules.

Many data sets consist of large amounts of text, such as e-mail, so data-mining projects typically use textual analysis to dredge up connections within that data, a process known as text mining. Another promising avenue is audio mining (also called audio indexing), which is the process of extracting and indexing the words in an audio file and then using that index as data to be otherwise mined. It will come as no surprise that engineers have also come up with ingenious methods for indexing other types of media, including image mining and video mining. If the data set consists of geographical information, it is called spatial (or geospatial) mining. In this increasingly social world, researchers are turning to crowd mining, where they try to unearth useful knowledge from large databases of social information. On a more general level, Web mining refers to the harvesting of useful patterns from data sets of Web content, Web usage (such as server logs), and Web structure (such as hyperlinks).

If a data set is just too large to probe efficiently, data miners can often get away with sampling portions of it, a technique variously known as data dredging, data fishing, or data snooping.

Data mining sounds innocent enough on the surface, but privacy advocates warn that it can be used for nonbenign purposes. When Internet service providers and companies such as Google hoard massive data sets that detail the online activities of hundreds of millions of people, automated data mining methods can analyze that data to look for patterns of suspicious activity. As computer scientist Jonathan Zittrain has pointed out, "When governments begin to suspect people because of where they were at a certain time, it can get very worrying."

Whether it's a boon or a bane, informative or intrusive, you've seen here that the field of data mining is a rich source of new words and phrases. As I see it, my job here at IEEE Spectrum is to sift through the raw material of articles, papers, blogs, and books to uncover new lexical gems and then present them to you in this column. Call it word mining.

This article originally appeared in print as "The Data Gold Rush."

innovation networks software databases data data mining software engineering

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

IT Has 26 Words for Data Mining

As data proliferate, so do words for handling them

Related Stories

Do These 10 Things for Better Product Development

Disruption, Disrupted

To Look Forward, Sometimes You Have to Look Back

This article is for IEEE members only. Join IEEE to access our full archive.

Membership includes:

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

IT Has 26 Words for Data Mining

As data proliferate, so do words for handling them

Related Stories

Do These 10 Things for Better Product Development

Disruption, Disrupted

To Look Forward, Sometimes You Have to Look Back

This article is for IEEE members only. Join IEEE to access our full archive.

Membership includes: