Building a Clustering Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

So, if we use these factor vector representations of each movie as inputs to our clustering

model, we will end up with a clustering that is based on the actual rating behavior of users

rather than manual genre assignments.

The same logic applies to the user factors—they represent users in the latent feature space

of rating behavior, so clustering the user vectors should result in a clustering based on user

rating behavior.

Extracting movie genre labels

Before proceeding further, let's extract the genre mappings from the u.genre file. As

you can see from the first line of the preceding dataset, we will need to map from the nu-

merical genre assignments to the textual version so that they are readable.

Take a look at the first few lines of u.genre :

val genres = sc.textFile("/PATH/ml-100k/u.genre")

genres.take(5).foreach(println)

You should see the following output displayed:

unknown|0

Action|1

Adventure|2

Animation|3

Children's|4

Here, 0 is the index of the relevant genre, while unknown is the genre assigned for this

index. The indices correspond to the indices of the binary subvector that will represent the

genres for each movie (that is, the 0s and 1s in the preceding movie data).

To extract the genre mappings, we will split each line and extract a key-value pair, where

the key is the text genre and the value is the index. Note that we have to filter out an

empty line at the end; this will, otherwise, throw an error when we try to split the line (see

the code highlighted here):

val genreMap = genres.filter( !_.isEmpty ).map(line =>

line.split("\\|")).map(array => (array(1),

array(0))).collectAsMap

println(genreMap)

Search WWH ::

Custom Search

Home