Database Reference
In-Depth Information
each (user, movie) pair in the published dataset included a rating (1-5 stars) and the date
on which the rating was made.
The RMSE was used to measure the performance of algorithms. CineMatch has an
RMSE of approximately 0.95; i.e., the typical rating would be off by almost one full star.
To win the prize, it was necessary that your algorithm have an RMSE that was at most 90%
of the RMSE of CineMatch.
The bibliographic notes for this chapter include references to descriptions of the winning
algorithms. Here, we mention some interesting and perhaps unintuitive facts about the chal-
lenge.
• CineMatch was not a very good algorithm. In fact, it was discovered early that the
obvious algorithm of predicting, for the rating by user u on movie m , the average
of:
(1) The average rating given by u on all rated movies and
(2) The average of the ratings for movie m by all users who rated that movie.
was only 3% worse than CineMatch.
• The UV-decomposition algorithm described in Section 9.4 was found by three stu-
dents (Michael Harris, Jeffrey Wang, and David Kamm) to give a 7% improvement
over CineMatch, when coupled with normalization and a few other tricks.
• The winning entry was actually a combination of several different algorithms that
had been developed independently. A second team, which submitted an entry that
would have won, had it been submitted a few minutes earlier, also was a blend of
independent algorithms. This strategy - combining different algorithms - has been
used before in a number of hard problems and is something worth remembering.
• Several attempts have been made to use the data contained in IMDB, the Internet
movie database, to match the names of movies from the NetFlix challenge with
their names in IMDB, and thus extract useful information not contained in the
NetFlix data itself. IMDB has information about actors and directors, and classifies
movies into one or more of 28 genres. It was found that genre and other informa-
tion was not useful. One possible reason is the machine-learning algorithms were
able to discover the relevant information anyway, and a second is that the entity
resolution problem of matching movie names as given in NetFlix and IMDB data
is not that easy to solve exactly.
• Time of rating turned out to be useful. It appears there are movies that are more
likely to be appreciated by people who rate it immediately after viewing than by
those who wait a while and then rate it. “Patch Adams” was given as an example
of such a movie. Conversely, there are other movies that were not liked by those
who rated it immediately, but were better appreciated after a while; “Memento”
Search WWH ::




Custom Search