Real-time search is trending. Late last year, Google and Microsoft rolled out much-anticipated real-time search engines, with Yahoo hard on their heels. A plethora of smaller start-ups offer services from Twitter searches to more comprehensive info grabs. But early results are in, and so far these engines aren’t skating straight 6s.
”So far, I’ve been pretty underwhelmed by the real-time search efforts,” says blogger and tech expert Robert Scoble. Search Engine Land editor Danny Sullivan also cautions in a blog post that while the idea of real-time search is exciting, ”search engines are acting a bit like cats getting a sniff of catnip. They’re high on real-time search and acting kind of crazy.”
In a nutshell, real-time search means finding, indexing, and displaying in a search window any results that have just been posted to the information ether—be it a 140-character tweet, a status update on Facebook, a new blog entry, a photo, or even a news article. Search engines big and small are falling over themselves to roll out the technology that makes it happen.
Real-time content is ”cool,” Sullivan says, because there is ”some content that only exists in these microblog formats,” his label for tiny, instantaneous snippets of information. Driving home one night in December, for example, Sullivan found a tree that had fallen and was covering all the lanes in the road. ”It just happened—no news outlet had covered it yet, so I put out a picture and the location [on Twitter],” he says. ”More and more you see this real-time content out there. And people want to get at it.…If you don’t index these kind of stories, you miss them.”
But as more information gets added to the Web, search engines have to find better ways to sift through that content. According to Filippo Menczer, an Indiana University professor of informatics and computer science, the question is: How can we make sure that search engines can access information that was posted 2 minutes ago, even though their crawlers might visit a site only once a day or once a week? If someone tweets about an earthquake but you don’t see the post until a week later, Menczer says, ”that’s not useful.”
Enter real-time search.
Sullivan argues that the term ”real-time” should include only information that’s written and posted immediately. Basically, that means a tweet or a Facebook status update. Even a blog takes time to write, he says, so by the time it’s posted—even if it’s immediately available through instantaneous search algorithms—the ideas in it are old news, and therefore not real-time.
But to Stefan Weitz, director of Microsoft’s Bing search engine, real-time search is ”a lot more than tweets or status updates.” There are dozens of ways to put out real-time information, Weitz points out. There’s photo tagging, YouTube, and MySpace, and ”most of these forums have a temporal aspect,” he says. But sometimes information from two hours ago—or even last week—is real-time enough, he argues. It just comes down to the user.
For example, Bing’s recently released Bing Maps includes an application called Local Lens, which scans local community blogs and links to them, allowing a user to scroll over locations on a map to find out what’s happening around town. Those blogs are updated a few times a day, Weitz says, which wouldn’t be real-time according to Sullivan. Still, ”that’s real-time-ish,” Weitz maintains. Sure, it’s ”not as real-time as Twitter,” but it could still make a difference in deciding what to do on a Saturday night.
Key to this instantaneous search technology is having access to the data streams. ”[Search engines] can’t crawl so fast, so they need agreements” with the key information sources, says Menczer, who is also associate director of IU’s Center for Complex Networks and Systems Research. Accordingly, Bing and Google got access to Twitter’s ”firehose feed”—all the data streaming in from every tweet—last October. Bing got Facebook data at the same time, and Google got Facebook plus MySpace in December.
Once the search engines have all this data, of course, the trick is figuring out what to do with it. Even ”assuming we have access to much of this content,” says Menczer, ”it’s no longer in the tradition of Web text and hyperlinks.” So PageRank, Google’s tool for ranking Web sites—and, ultimately, the relevancy of results to a particular search—”may not be readily applied to social Web,” he says.
Google’s search guru Amit Singhal acknowledges that to corral all this content into a results page isn’t easy, and PageRank depends on building authority over time. So while Singhal indicated at last December’s product release press conference that PageRank is still an important part of the engine’s ranking ability, other signals have to be added to make real-time search a powerful tool. Google engineers had to develop ”at least a dozen new technologies” to pick out the most relevant pieces of information from the mass of data flooding the Web, including models of volume fluctuation (for instance, when a lot of tweets suddenly come in on a specific topic) and new language models (is it genuine information, or is it a ”weather buoy” tweeting automatically?).