Building a Clustering Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

The preceding code will provide the following output:

Map(2 -> Adventure, 5 -> Comedy, 12 -> Musical, 15 ->

Sci-Fi, 8 -> Drama, 18 -> Western, ...

Next, we'll create a new RDD from the movie data and our genre mapping; this RDD con-

tains the movie ID, title, and genres. We will use this later to create a more readable out-

put when we evaluate the clusters assigned to each movie by our clustering model.

In the following code section, we will map over each movie and extract the genres sub-

vector (which will still contain Strings rather than Int indexes). We will then apply

the zipWithIndex method to create a new collection that contains the indices of the

genre subvector, and we will filter this collection so that we are left only with the positive

assignments (that is, the 1s that denote a genre assignment for the relevant index). We can

then use our extracted genre mapping to map these indices to the textual genres. Finally,

we will inspect the first record of the new RDD to see the result of these operations:

val titlesAndGenres = movies.map(_.split("\\|")).map {

array =>

val genres = array.toSeq.slice(5, array.size)

val genresAssigned = genres.zipWithIndex.filter { case

(g, idx) =>

g == "1"

}.map { case (g, idx) =>

genreMap(idx.toString)

}

(array(0).toInt, (array(1), genresAssigned))

}

println(titlesAndGenres.first)

This should output the following result:

(1,(Toy Story (1995),ArrayBuffer(Animation, Children's,

Comedy)))

Training the recommendation model

To get the user and movie factor vectors, we first need to train another recommendation

model. Fortunately, we have already done this in Chapter 4 , Building a Recommendation

Engine with Spark , so we will follow the same procedure:

Search WWH ::

Custom Search

Home