Database Reference
In-Depth Information
Tip
The preceding
pow
function is a Breeze universal function. This function is the same as
the
pow
function from
scala.math
, except that it operates element-wise on the vector
that is returned from the minus operation between the two input vectors.
Now, we will use this function to compute, for each movie, the distance of the relevant
movie factor vector from the center vector of the assigned cluster. We will also join our
cluster assignments and distances data with the movie titles and genres so that we can out-
put the results in a more readable way:
val titlesWithFactors = titlesAndGenres.join(movieFactors)
val moviesAssigned = titlesWithFactors.map { case (id,
((title, genres), vector)) =>
val pred = movieClusterModel.predict(vector)
val clusterCentre = movieClusterModel.clusterCenters(pred)
val dist =
computeDistance(DenseVector(clusterCentre.toArray),
DenseVector(vector.toArray))
(id, title, genres.mkString(" "), pred, dist)
}
val clusterAssignments = moviesAssigned.groupBy { case (id,
title, genres, cluster, dist) => cluster }.collectAsMap
After running the preceding code snippet, we have an RDD that contains a set of key-
value pairs for each cluster; here, the key is the numeric cluster identifier, and the value is
made up of a set of movies and related information. The movie information we have is the
movie ID, title, genres, cluster index, and distance of the movie's factor vector from the
cluster center.
Finally, we will iterate through each cluster and output the top 20 movies, ranked by dis-
tance from closest to the cluster center:
for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) {
println(s"Cluster $k:")
val m = v.toSeq.sortBy(_._5)
println(m.take(20).map { case (_, title, genres, _, d) =>
(title, genres, d) }.mkString("\n"))