Building a Clustering Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Interpreting cluster predictions on the MovieLens

dataset

We have covered how to make predictions for a set of input vectors, but how do we evalu-

ate how good the predictions are? We will cover performance metrics a little later;

however, here, we will see how to manually inspect and interpret the cluster assignments

made by our K-means model.

While unsupervised techniques have the advantage that they do not require us to provide

labeled data for training, the disadvantage is that often, the results need to be manually in-

terpreted. Often, we would like to further examine the clusters that are found and possibly

try to interpret them and assign some sort of labeling or categorization to them.

For example, we can examine the clustering of movies we have found to try to see whether

there is some meaningful interpretation of each cluster, such as a common genre or theme

among the movies in the cluster. There are many approaches we can use, but we will start

by taking a few movies in each cluster that are closest to the center of the cluster. These

movies, we assume, would be the ones that are least likely to be marginal in terms of their

cluster assignment, and so, they should be among the most representative of the movies in

the cluster. By examining these sets of movies, we can see what attributes are shared by the

movies in each cluster.

Interpreting the movie clusters

To begin, we need to decide what we mean by "closest to the center of each cluster". The

objective function that is minimized by K-means is the sum of Euclidean distances between

each point and the cluster center, summed over all clusters. Therefore, it is natural to use

the Euclidean distance as our measure.

We will define this function here. Note that we will need access to certain imports from the

Breeze library (a dependency of MLlib) for linear algebra and vector-based numerical

functions:

import breeze.linalg._

import breeze.numerics.pow

def computeDistance(v1: DenseVector[Double], v2:

DenseVector[Double]) = pow(v1 - v2, 2).sum

Search WWH ::

Custom Search

Home