Building a Clustering Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Tip

The preceding pow function is a Breeze universal function. This function is the same as

the pow function from scala.math , except that it operates element-wise on the vector

that is returned from the minus operation between the two input vectors.

Now, we will use this function to compute, for each movie, the distance of the relevant

movie factor vector from the center vector of the assigned cluster. We will also join our

cluster assignments and distances data with the movie titles and genres so that we can out-

put the results in a more readable way:

val titlesWithFactors = titlesAndGenres.join(movieFactors)

val moviesAssigned = titlesWithFactors.map { case (id,

((title, genres), vector)) =>

val pred = movieClusterModel.predict(vector)

val clusterCentre = movieClusterModel.clusterCenters(pred)

val dist =

computeDistance(DenseVector(clusterCentre.toArray),

DenseVector(vector.toArray))

(id, title, genres.mkString(" "), pred, dist)

}

val clusterAssignments = moviesAssigned.groupBy { case (id,

title, genres, cluster, dist) => cluster }.collectAsMap

After running the preceding code snippet, we have an RDD that contains a set of key-

value pairs for each cluster; here, the key is the numeric cluster identifier, and the value is

made up of a set of movies and related information. The movie information we have is the

movie ID, title, genres, cluster index, and distance of the movie's factor vector from the

cluster center.

Finally, we will iterate through each cluster and output the top 20 movies, ranked by dis-

tance from closest to the cluster center:

for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) {

println(s"Cluster $k:")

val m = v.toSeq.sortBy(_._5)

println(m.take(20).map { case (_, title, genres, _, d) =>

(title, genres, d) }.mkString("\n"))

Search WWH ::

Custom Search

Home