Recommendation Systems - Mining of Massive Datasets

Database Reference

In-Depth Information

each (user, movie) pair in the published dataset included a rating (1-5 stars) and the date

on which the rating was made.

The RMSE was used to measure the performance of algorithms. CineMatch has an

RMSE of approximately 0.95; i.e., the typical rating would be off by almost one full star.

To win the prize, it was necessary that your algorithm have an RMSE that was at most 90%

of the RMSE of CineMatch.

The bibliographic notes for this chapter include references to descriptions of the winning

algorithms. Here, we mention some interesting and perhaps unintuitive facts about the chal-

lenge.

• CineMatch was not a very good algorithm. In fact, it was discovered early that the

obvious algorithm of predicting, for the rating by user u on movie m , the average

of:

(1) The average rating given by u on all rated movies and

(2) The average of the ratings for movie m by all users who rated that movie.

was only 3% worse than CineMatch.

• The UV-decomposition algorithm described in Section 9.4 was found by three stu-

dents (Michael Harris, Jeffrey Wang, and David Kamm) to give a 7% improvement

over CineMatch, when coupled with normalization and a few other tricks.

• The winning entry was actually a combination of several different algorithms that

had been developed independently. A second team, which submitted an entry that

would have won, had it been submitted a few minutes earlier, also was a blend of

independent algorithms. This strategy - combining different algorithms - has been

used before in a number of hard problems and is something worth remembering.

• Several attempts have been made to use the data contained in IMDB, the Internet

movie database, to match the names of movies from the NetFlix challenge with

their names in IMDB, and thus extract useful information not contained in the

NetFlix data itself. IMDB has information about actors and directors, and classifies

movies into one or more of 28 genres. It was found that genre and other informa-

tion was not useful. One possible reason is the machine-learning algorithms were

able to discover the relevant information anyway, and a second is that the entity

resolution problem of matching movie names as given in NetFlix and IMDB data

is not that easy to solve exactly.

• Time of rating turned out to be useful. It appears there are movies that are more

likely to be appreciated by people who rate it immediately after viewing than by

those who wait a while and then rate it. “Patch Adams” was given as an example

of such a movie. Conversely, there are other movies that were not liked by those

who rated it immediately, but were better appreciated after a while; “Memento”

Search WWH ::

Custom Search

Home