As a result of this process, the more established
site will have acquired a new link and increased its
popularity (as measured by PageRank). The next time
someone searches for information on the minollo, it will
be more likely that the established site will be ranked
even higher—fourth, say. The visibility of the less
established site, on the other hand, will not increase.
Now multiply this process by the billions of pages and
links in the Web graph, and by the millions of queries
handled by search engines every day. It is well known
that the popularity of pages (measured by their incoming
links) has a distribution with a very long tail, in
which a small fraction of very popular sites attract
most of the links. This is the so-called rich-get-richer
property of the Web. It seems reasonable to conclude
that search engines amplify such inequality among Web
sites by the vicious cycle described above: popular
sites become even more popular, while new pages are less
and less likely to ever be discovered.
While search engines do not make for a level
playing field, their use partially mitigates the
rich-get-richer nature of the Web, giving new sites an
increased chance of being discovered.
People have called this presumed phenomenon by many
names, including googlearchy (cf., Hindman, M. et al.,
2003. "Googlearchy:
How a Few Heavily-Linked Sites Dominate Politics on
the Web."). We have found many academic
articles and blogs that discuss the phenomenon from
technical, social, and political perspectives. Some
scholars simply assume it, some present case studies
that seem to support the idea, and a few technical
papers attempt to prove and quantify the effect by
indirect measures, as well as to develop remedies.
In a recent paper (cf.,
http://arxiv.org/cs.CY/0511005), our
group set out to quantify and model (predict) this
effect by using empirical measurements that should allow
us to directly gauge the effect. We followed the
approach of Junghoo Cho at the University of California
at Los Angeles and his collaborators and extended
their
analysis to connect the indegree (number of
incoming links) of a site with the traffic to it. The
two are linked by a chain of scaling relationships
through PageRank, the rank of a search result page, and
the probability that the user will click on a hit. There
are three possible scenarios with clear scaling
signatures. First, the googlearchy effect would generate
a superlinear behavior of traffic as a function of
indegree. Second, if search engines were
popularity-neutral, i.e., if their ranking algorithms
did not amplify page popularity beyond that determined
by the Web graph structure, then the scaling would be
linear, with traffic being simply proportional to
popularity. Finally, a sublinear scaling would be the
signature of a popularity mitigation effect by search
engines.
To measure indegree we used the two major search
engines, Google and Yahoo, and to measure traffic we
used Alexa, the service that collects data from the
users of its toolbar. These measures of indegree and
traffic are the best available, but they are also
notoriously noisy as some critics have observed (cf.,
"Egalitarian
Engines." The Economist, 19 Nov. 2005, p. 86).
To deal with the noise, we sampled a very large number
of random Web sites. We also employed logarithmic
statistical analysis techniques, which are very robust
to noise. This was possible because traffic and indegree
span many orders of magnitude, and we only need to study
their scaling relationship—not precise individual
values. We obtained remarkably consistent results using
either Google or Yahoo data (which yield very different
individual values) and by repeating our data collection
over several months (during which the two search engines
announced major upgrades of their collections).
Figure 1 shows
the results of our analysis. The theoretical predictions
are power laws, which appear as straight lines on the
log-scale plot. The blue area represents the predictions
corresponding to a googlearchy effect (superlinear
scaling) while the line labeled "surfing model"
represents the case in which search engines are neutral,
as if users visited sites by surfing rather than
searching. The empirical data do not fit a power law;
but it is evident that traffic grows sublinearly with
indegree. Contrary to our expectation, this result
suggests that search engines actually have an
egalitarian effect, directing more traffic than expected
to less popular sites. Search engines thus appear to
counteract the skewed distribution of links in the Web,
directing some traffic toward sites that users would
never visit if they were just surfing rather than
searching. This egalitarian effect—which we could call
googlocracy—is at odds with the arguments above. What
gives?
To understand the observed googlocracy effect, we
need to reconsider our model of how users visit sites as
a result of their searches. The key factor that we have
neglected in the original model is what kind of content
users are interested in, and consequently what queries
they submit. To develop a "semantically correct" model
we must look at what people are searching for. We did
this by analyzing a large query log from AltaVista
containing almost 240 000 queries submitted by actual
users. We then looked at how many results are returned
by a search engine (we used Google) for these real
queries. What we found is that like that of indegree,
the statistical distribution of hit set size is also
very skewed. Rarely are queries so general that the
search engine returns a significant fraction of its
collection; for example one in 1000 queries returns one
tenth of the collection or more. The majority of queries
return less than 30 000 hits (less than one page in a
million from the collection), and for 4 percent of the
queries there are only a screenful of results (10 or
fewer hits).
When we take into account the semantic content of
queries and how it affects searches, we get a
semantically correct model that we can test by
simulation. As shown in Figure 1, the
prediction is remarkably accurate. The
idea is that if a query returns few
hits, then it is unlikely that globally popular
pages will be included, notwithstanding their huge
PageRank and indegree. On the other hand, new and less
established sites have a higher chance to be relevant to
specific queries, thus gaining visibility with respect
to topics that have not (yet) been assimilated by the
major sources. Going back to our earlier example, if few
people knew about the minollo, chances are that the
student would have found fewer than ten hits, thus
visiting both minollo-recipes.com and
save-the-minollo.org. Since such specific queries are
the majority, search engines have an egalitarian effect
as they direct traffic to sites that would never be
discovered by browsing alone.
While search engines do not make for a level playing
field, their use partially mitigates the rich-get-richer
nature of the Web, giving new sites an increased chance
of being discovered, as long as they are about specific
topics that match the interests of users. So it seems
that cyberspace, for now, is more of a googlocracy than
a googlearchy.