The Future of Big Data: Distilling Less Knowledge Per Bit

Without higher-value analyses, big data will overwhelm us

Advertisement

Until recently, the word data didn’t require a modifier. But we passed a watershed moment when we started referring to big data. Apparently, that wasn’t a sufficient description for some chunks of data, because people grasped for bolder terms, such as humongous data. Sadly, now, it appears that we have run out of appropriate adjectives. And yet data keeps getting bigger and bigger.

So instead of mentioning data, people have begun waving their hands and talking vaguely about the “cloud.” This seems to be the perfect metaphor—a mystical vapor hanging over Earth, occasionally raining information on the parched recipients below. It is both unknowable and all-knowing. It answers all questions, if only we know how to interpret those answers.

This evolution brings to mind two images. The first is from the current scientific hypothesis that all of the information in a black hole resides in the event horizon that surrounds it. This is like the idea of the cloud, while on Earth below, the practical reality of the cloud manifests in proliferating server farms. These farms bring the second image to mind: Douglas Adams’s city-size supercomputer, Deep Thought, from the classic novel (and radio play and TV show and movie) The Hitchhiker’s Guide to the Galaxy.

With these imaginary end states in mind, I wonder: Where is all this headed? Will data increase indefinitely, or is there some point of diminishing returns? Is there such a thing as enough data—or possibly too much?

There is a popular saying that “data is the new oil.” While I think this is an imperfect metaphor, it is true that both oil and data require refining to be useful. I’m mindful of the information pyramid described in T.S. Eliot’s poem “The Rock”: “Where is the wisdom we have lost in knowledge? / Where is the knowledge we have lost in information?”

For the purposes of our discussion, let’s say that data is composed of 1s and 0s, information is the words and images encoded by data, and knowledge is what we glean or learn from that information. The critical refining is between information and knowledge. In the refining of oil, the ratio of the useful final product to the starting amount of crude is not a function of the amount of crude. Not so with information: The more crude information we have to deal with, the less knowledge we want to produce per bit. Otherwise, big data will simply overwhelm us as it continues to grow. What we want is the small knowledge that we obtain from the big information. As the data set gets bigger, the job gets harder. The catch, however, is that unless the big information is big enough, it may not contain the small signal that we are searching for.

Knowledge inevitably increases, so data has to increase even faster. Fortunately, storage technology seems capable of coping without turning Earth into a giant disk drive, but the crunch will be on the artificial intelligence and algorithms that turn data into knowledge. We have come a long way since Claude Shannon, in his classic paper on information theory, in 1948 [PDF], could simply ignore the knowledge problem by writing: “Frequently the messages have meaning.... These semantic aspects of communication are irrelevant to the engineering problem.”

I’m also mindful of the propensity of drawers, closets, and hard drives to eventually become filled with useless junk. I sometimes blame this on the second law of thermodynamics, which states that entropy—that is, disorder—always increases. Perhaps this will ultimately be true of the cloud. Old, useless information accumulates, and it’s too much work to purge it. Moreover, who’s to say what is useless and what is not? Everything is in there, but everything is too much. Entropy is maximized, and the data ultimately becomes, as Shakespeare put it, full of sound and fury, signifying nothing.

This article appears in the May 2017 print issue as “The Counterintuitive Cloud.”

Advertisement