Database Reference
In-Depth Information
So, if we use these factor vector representations of each movie as inputs to our clustering
model, we will end up with a clustering that is based on the actual rating behavior of users
rather than manual genre assignments.
The same logic applies to the user factors—they represent users in the latent feature space
of rating behavior, so clustering the user vectors should result in a clustering based on user
rating behavior.
Extracting movie genre labels
Before proceeding further, let's extract the genre mappings from the
u.genre
file. As
you can see from the first line of the preceding dataset, we will need to map from the nu-
merical genre assignments to the textual version so that they are readable.
Take a look at the first few lines of
u.genre
:
val genres = sc.textFile("/PATH/ml-100k/u.genre")
genres.take(5).foreach(println)
You should see the following output displayed:
unknown|0
Action|1
Adventure|2
Animation|3
Children's|4
Here,
0
is the index of the relevant genre, while
unknown
is the genre assigned for this
index. The indices correspond to the indices of the binary subvector that will represent the
genres for each movie (that is, the 0s and 1s in the preceding movie data).
To extract the genre mappings, we will split each line and extract a key-value pair, where
the key is the text genre and the value is the index. Note that we have to filter out an
empty line at the end; this will, otherwise, throw an error when we try to split the line (see
the code highlighted here):
val genreMap = genres.filter(
!_.isEmpty
).map(line =>
line.split("\\|")).map(array => (array(1),
array(0))).collectAsMap
println(genreMap)