As a result of this process, the more established site will have acquired a new link and increased its popularity (as measured by PageRank). The next time someone searches for information on the minollo, it will be more likely that the established site will be ranked even higher--fourth, say. The visibility of the less established site, on the other hand, will not increase. Now multiply this process by the billions of pages and links in the Web graph, and by the millions of queries handled by search engines every day. It is well known that the popularity of pages (measured by their incoming links) has a distribution with a very long tail, in which a small fraction of very popular sites attract most of the links. This is the so-called rich-get-richer property of the Web. It seems reasonable to conclude that search engines amplify such inequality among Web sites by the vicious cycle described above: popular sites become even more popular, while new pages are less and less likely to ever be discovered.
While search engines do not make for a level playing field, their use partially mitigates the rich-get-richer nature of the Web, giving new sites an increased chance of being discovered.
People have called this presumed phenomenon by many names, including googlearchy (cf., Hindman, M. et al., 2003. "Googlearchy: How a Few Heavily-Linked Sites Dominate Politics on the Web."). We have found many academic articles and blogs that discuss the phenomenon from technical, social, and political perspectives. Some scholars simply assume it, some present case studies that seem to support the idea, and a few technical papers attempt to prove and quantify the effect by indirect measures, as well as to develop remedies.
In a recent paper (cf., http://arxiv.org/cs.CY/0511005), our group set out to quantify and model (predict) this effect by using empirical measurements that should allow us to directly gauge the effect. We followed the approach of Junghoo Cho at the University of California at Los Angeles and his collaborators and extended their analysis to connect the indegree (number of incoming links) of a site with the traffic to it. The two are linked by a chain of scaling relationships through PageRank, the rank of a search result page, and the probability that the user will click on a hit. There are three possible scenarios with clear scaling signatures. First, the googlearchy effect would generate a superlinear behavior of traffic as a function of indegree. Second, if search engines were popularity-neutral, i.e., if their ranking algorithms did not amplify page popularity beyond that determined by the Web graph structure, then the scaling would be linear, with traffic being simply proportional to popularity. Finally, a sublinear scaling would be the signature of a popularity mitigation effect by search engines.
To measure indegree we used the two major search engines, Google and Yahoo, and to measure traffic we used Alexa, the service that collects data from the users of its toolbar. These measures of indegree and traffic are the best available, but they are also notoriously noisy as some critics have observed (cf., "Egalitarian Engines." The Economist, 19 Nov. 2005, p. 86). To deal with the noise, we sampled a very large number of random Web sites. We also employed logarithmic statistical analysis techniques, which are very robust to noise. This was possible because traffic and indegree span many orders of magnitude, and we only need to study their scaling relationship--not precise individual values. We obtained remarkably consistent results using either Google or Yahoo data (which yield very different individual values) and by repeating our data collection over several months (during which the two search engines announced major upgrades of their collections).
Figure 1shows the results of our analysis. The theoretical predictions are power laws, which appear as straight lines on the log-scale plot. The blue area represents the predictions corresponding to a googlearchy effect (superlinear scaling) while the line labeled "surfing model" represents the case in which search engines are neutral, as if users visited sites by surfing rather than searching. The empirical data do not fit a power law; but it is evident that traffic grows sublinearly with indegree. Contrary to our expectation, this result suggests that search engines actually have an egalitarian effect, directing more traffic than expected to less popular sites. Search engines thus appear to counteract the skewed distribution of links in the Web, directing some traffic toward sites that users would never visit if they were just surfing rather than searching. This egalitarian effect--which we could call googlocracy--is at odds with the arguments above. What gives?
To understand the observed googlocracy effect, we need to reconsider our model of how users visit sites as a result of their searches. The key factor that we have neglected in the original model is what kind of content users are interested in, and consequently what queries they submit. To develop a "semantically correct" model we must look at what people are searching for. We did this by analyzing a large query log from AltaVista containing almost 240 000 queries submitted by actual users. We then looked at how many results are returned by a search engine (we used Google) for these real queries. What we found is that like that of indegree, the statistical distribution of hit set size is also very skewed. Rarely are queries so general that the search engine returns a significant fraction of its collection; for example one in 1000 queries returns one tenth of the collection or more. The majority of queries return less than 30 000 hits (less than one page in a million from the collection), and for 4 percent of the queries there are only a screenful of results (10 or fewer hits).
When we take into account the semantic content of queries and how it affects searches, we get a semantically correct model that we can test by simulation. As shown in Figure 1, the prediction is remarkably accurate. The idea is that if a query returns few hits, then it is unlikely that globally popular pages will be included, notwithstanding their huge PageRank and indegree. On the other hand, new and less established sites have a higher chance to be relevant to specific queries, thus gaining visibility with respect to topics that have not (yet) been assimilated by the major sources. Going back to our earlier example, if few people knew about the minollo, chances are that the student would have found fewer than ten hits, thus visiting both minollo-recipes.com and save-the-minollo.org. Since such specific queries are the majority, search engines have an egalitarian effect as they direct traffic to sites that would never be discovered by browsing alone.
While search engines do not make for a level playing field, their use partially mitigates the rich-get-richer nature of the Web, giving new sites an increased chance of being discovered, as long as they are about specific topics that match the interests of users. So it seems that cyberspace, for now, is more of a googlocracy than a googlearchy.
About the Authors
Filippo Menczer is an associate professor of informatics, computer science, and cognitive science at Indiana University, Bloomington. His research interests focus on intelligent systems for Web mining. Santo Fortunato is a postdoctoral research scholar at the Indiana University School of Informatics. His current research focuses on technological networks and the social dynamics of opinion formation. Alessandro Flammini is an assistant professor in the School of Informatics at Indiana University. His interests are mainly in the study of complex networks and in the physics of biopolymers. Alessandro Vespignani is a professor of informatics, cognitive science, and physics at Indiana University. He works on the theory of complex systems and networks.
Comments