Database Reference
In-Depth Information
Extracting features from the MovieLens dataset
For this example, we will return to the movie rating dataset we used in Chapter 4 , Building
a Recommendation Engine with Spark . Recall that we have three main datasets: one that
contains the movie ratings (in the u.data file), a second one with user data ( u.user ),
and a third one with movie data ( u.item ). We will also be using the genre data file to ex-
tract the genres for each movie ( u.genre ).
We will start by looking at the movie data:
val movies = sc.textFile("/PATH/ml-100k/u.item")
println(movies.first)
This should output the first line of the dataset:
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/
title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
So, we have access to the move title, and we already have the movies categorized into
genres. Why do we need to apply a clustering model to the movies? Clustering the movies
is a useful exercise for two reasons:
• First, because we have access to the true genre labels, we can use these to evaluate
the quality of the clusters that the model finds
• Second, we might wish to segment the movies based on some other attributes or
features, apart from their genres
For example, in this case, it seems that we don't have a lot of data to use for clustering,
apart from the genres and title. However, this is not true—we also have the ratings data.
Previously, we created a matrix factorization model from the ratings data. The model is
made up of a set of user and movie factor vectors.
We can think of the movie factors as representing each movie in a new latent feature space,
where each latent feature, in turn, represents some form of structure in the ratings matrix.
While it is not possible to directly interpret each latent feature, they might represent some
hidden structure that influences the ratings behavior between users and movies. One factor
could represent genre preference, another could refer to actors or directors, while yet anoth-
er could represent the theme of the movie, and so on.
Search WWH ::




Custom Search