Zipf Drive

Image: Laura H. Azran

I've long been fascinated with the omnipresence of power-law statistics in natural and social phenomena. A good example is Zipf's Law for the usage of English words, named for the 20th-century linguist George Kingsley Zipf. The most common word, the , is used twice as often as the second most popular word ( of ) and three times as often as the third ( and ). Similarly, the n th most popular word has a relative frequency of use of 1/ n .

Thus, the curve of popularity versus rank shows a steep decline at first, followed by a long tail that looks rather flat when plotted on a linear scale. (On a log-log plot, of course, this becomes a straight line.) A word like omnipresence is way out on the tail, at popularity position 74 228, right before the word Borodin (the Russian composer), according to WordCount ( http://wordcount.org).

All of the most common words are short, resulting in a very efficient transmission of information. I imagine our distant ancestors sitting around the fire, drawing information-theory equations with sticks in the mud to come up with an optimally parsimonious language, after which they would decide that they shouldn't have used the word parsimonious (popularity number 49 309) when something like concise would have sufficed.

All this is to say that our vocabulary is rather a perfect blend--100 or so popular words used in everyday conversation and writing, together with about 100 000 more esoteric words that get sprinkled in for effect or special purpose.

Many other phenomena exhibit power-law (that is, polynomial) statistics--cities ranked by population, individuals by wealth, earthquakes by strength, Web sites by number of hits, books by online sales. I would even imagine that it applies to something like the distribution of knowledge in electrical engineering. All of us know Ohm's Law, for example, but perhaps only a tenth of us are familiar with the basic concepts in communications. Then maybe only one engineer in 1000 is familiar with a particular protocol, and only one in 100 000 might be conversant with a particular paper in a specific IEEE Transactions . But this is what makes the world go round; we have a lot of things in common, but there is a long tail of specialties that makes each individual unique.

Although power-law statistics have been long known, the subject has gotten much recent attention under the name ”the long tail,” a phrase coined by Chris Anderson, the editor in chief of Wired magazine, in an article in 2004. Discussions have been prompted by the difference between sales in the physical world, where inventories are limited to the popular items, and those in the virtual world of the Internet, where there is no inventory constraint to eliminate all the rare items on the long tail. In the virtual world, the many small sales out on the long tail approximately equal the sales of the few most popular items.

In most cases there are fundamental reasons that statistics behave like a power law. For example, even though it might seem as if individual choices should be uniformly distributed among alternatives, an individual's choice is often influenced by the choices of others. This explains our herdlike behavior, with a flocking around popular choices and a long tail of individual dissent.

How could it be otherwise? Suppose for a moment that power-law statistics weren't the norm and that choices were uniformly distributed. What would the world be like? With all 100 000 or so words equally likely, books would be long and turgid but of little interest, because there would be so few subjects of common concern. And of course it would be almost impossible to learn a foreign language.

Population would be uniformly scattered about the Earth. There would be no cities, and whole countries would be like New Jersey, where I have to describe my home's location by the nearest exit number on the Garden State Parkway. For better or for worse, wealth would be uniformly distributed, and perhaps neither cathedrals nor slums would be so prevalent.

I'm sure that you can provide your own suppositions, but perhaps we could all agree that we wouldn't want to inhabit such a world. Our ancient ancestors around the fire figured this out a long time ago.

About the Author

ROBERT W. LUCKY considers how power-law statistics apply to language in this month's Reflections column. Lucky, an IEEE Fellow, now retired, was vice president for applied research at Telcordia Technologies in Red Bank, N.J.

Advertisement
Advertisement