The Netflix researchers were nevertheless curious about the many other techniques for making similar predictions that had been published in the scholarly literature. The problem was that those studies had relied on public data sets containing on the order of a few million ratings, and it would take the small Netflix team a long time to explore how well these alternative techniques worked at a scale a thousand times as large. That is, if they did all the work themselves.

Reed Hastings, the chief executive of Netflix, suggested running a contest. He observed that Netflix had the means, the motive, and the opportunity. And the company had already staged a contest internally between the standard Cinematch system and an alternative algorithm, which did slightly, tantalizingly better.

The Netflix team came up with the basic structure of the contest. They would provide 100 million ratings that 480 000 anonymous customers had given to 17 000 movies. The data set would just fit in the main memory of a typical laptop, allowing almost anyone to compete. Netflix would withhold 3 million of the most recent ratings and ask the contestants to predict them. A Netflix computer would assess each contestant’s 3 million predictions by comparing predictions with actual ratings. The system would use the traditional metric for predictive accuracy, the root-mean squared error (RMSE). The more accurate a set of predictions, the smaller the RMSE will be. The score would then be reported back immediately to the contestant and reflected on an online leaderboard for all to see.

Each such scoring provides valuable information about the hidden ratings--so valuable, in fact, that under certain circumstances it could be used to game the system. The teams were therefore scored once a day, at most. But to help teams estimate how well they might do, Netflix also provided them each with a small representative data set and the score Cinematch had been able to attain for it. Contestants could use that set to test their systems as often as they wanted.

On 2 October 2006, Netflix launched the competition, and within days thousands of teams from hundreds of countries signed up. Within weeks the Netflix Prize Web site was getting hundreds of submissions per day. An online forum was created so that participants could share ideas and techniques, even code. Even more gratifying to Netflix, within months a handful of teams did several percent better than Cinematch. The question then was how much the accuracy would improve in the first year.

Like most of the other top competitors, the three of us at AT&T Labs consulted the rich body of research on ways of solving problems in this domain, known as collaborative filtering.

One of the main areas of collaborative filtering we exploited is the nearest-neighbor approach. A movie’s ”neighbors” in this context are other movies that tend to be scored most similarly when rated by the same viewer [see illustration, ”The Neighborhood Model”]. For example, consider Saving Private Ryan (1998), a war movie directed by Steven Spielberg and starring Tom Hanks. Its neighbors may include other war movies, movies directed by Spielberg, or movies starring Tom Hanks. To predict a particular viewer’s rating, we would look for the nearest neighbors to Saving Private Ryan that the viewer had already seen and rated. For some viewers, it may be easy to find a full allotment of close neighbors; for many others, we may discover only a handful of neighboring movies. Our version of the nearest-neighbor approach predicted ratings using a weighted average of the viewer’s previous ratings on up to 50 neighboring movies. (We have since developed a way to use all past ratings, allowing an unlimited number of neighbors.)

A second area of collaborative-filtering research we pursued involves what are known as latent-factor models. These score both a given movie and a given viewer according to a set of factors, themselves inferred from patterns in the ratings given to all the movies by all the viewers [see illustration, ”The Latent-Factor Approach”]. Factors for movies may measure comedy versus drama, action versus romance, and orientation to children versus orientation to adults. Because the factors are determined automatically by algorithms, they may correspond to hard-to-describe concepts such as quirkiness, or they may not be interpretable by humans at all. Factors for viewers measure how much the viewer likes movies that score highly on the corresponding movie factor. Thus, a movie may be classified a comedy, and a viewer may be classified as a comedy lover.

The model may use 20 to 40 such factors to locate each movie and viewer in a multidimensional space. It then predicts a viewer’s rating of a movie according to the movie’s score on the dimensions that person cares about most. We can put these judgments in quantitative terms by taking the dot (or scalar) product of the locations of the viewer and the movie.