Search engines are our key to access information on the Web. Without search engines, we would easily become lost in cyberspace (as in the early days of the Web), so it is not surprising to see how heavily we rely on search engines as our information gateways. According to the Search Engine Round Table blog, Jay McCarthy, vice president of Web server log analysis company Websidestory, announced at the 2005 Search Engine Strategies Conference in Toronto that the number of page referrals from search engines has surpassed those from other pages. This means that people navigate the Web by searching more than by browsing.
The question of search engine bias then becomes a crucial one. What if search engines showed only certain types of information, or preferred certain sources? Imagine for example submitting the query abortion and finding only pro-life (or only pro-choice) sites in the first screen of hits. There are many kinds of potential bias--linguistic, political, cultural, commercial, and so on. The issue of bias resonates in the public debate on our growing dependence on search engines and on their social impact as gatekeepers of information. Is an information monopoly developing the same way as the software monopoly of the recent past? Is Google the next Microsoft? If search engines are the lens through which we see the world, transparency is a major concern, and any bias gets in the way. Our worries are heightened because search engines are secretive about their algorithms and, thus, their biases are subtle to detect.
In the midst of this debate, one kind of bias that has received much attention among technologists, as well as social and political scientists, is that in favor of "popular" sites. This stems from the PageRank algorithm, introduced by the Google founders in 1998. All major search engines today use similar techniques to identify important or prestigious pages and bubble them to the top of the results. To a first approximation, PageRank attributes importance in proportion to the number of links that a page receives from other sites. The algorithm is a bit more sophisticated than that, but this approximation turns out to be pretty good on average (cf., http://arXiv.org/cs.IR/0511016).
The notion of prestige based on link popularity is a proxy for other possible importance measures, such as traffic, expert judgment, and so on. Most people would agree that the use of prestige measures in ranking search results is a very good thing--indeed, it's the main reason why search engines work so well and have become so popular. Moreover, PageRank is designed to mimic the browsing behavior of Web users. In the absence of better assumptions, we imagine that people follow links at random. PageRank then estimates the traffic through each site. It seems, therefore, to be just the right criterion to rank sites. Why worry then?
To understand the potential danger of popularity bias, let us envision a scenario in which people search for information about the minollo (an imaginary animal). Imagine that there is an established site called minollo-recipes.com about the minollo and its culinary qualities. Further imagine a newly developed site called save-the-minollo.org that holds the view that the minollo is an endangered species and it should no longer be hunted. Now, suppose a student is assigned the homework of creating a Web page with a report on the minollo. The student will submit the query 'minollo' to a search engine and, for lack of time, browse only the top ten hits. Let's say that minollo-recipes.com is the fifth hit, while save-the-minollo.org is ranked 15th. The student will read the established site and write her report on minollo recipes. She will not read about the possible endangered status of the minollo. She will also diligently cite her source by adding a link from her new page to minollo-recipes.com.































