Database Reference
In-Depth Information
Interpreting cluster predictions on the MovieLens
dataset
We have covered how to make predictions for a set of input vectors, but how do we evalu-
ate how good the predictions are? We will cover performance metrics a little later;
however, here, we will see how to manually inspect and interpret the cluster assignments
made by our K-means model.
While unsupervised techniques have the advantage that they do not require us to provide
labeled data for training, the disadvantage is that often, the results need to be manually in-
terpreted. Often, we would like to further examine the clusters that are found and possibly
try to interpret them and assign some sort of labeling or categorization to them.
For example, we can examine the clustering of movies we have found to try to see whether
there is some meaningful interpretation of each cluster, such as a common genre or theme
among the movies in the cluster. There are many approaches we can use, but we will start
by taking a few movies in each cluster that are closest to the center of the cluster. These
movies, we assume, would be the ones that are least likely to be marginal in terms of their
cluster assignment, and so, they should be among the most representative of the movies in
the cluster. By examining these sets of movies, we can see what attributes are shared by the
movies in each cluster.
Interpreting the movie clusters
To begin, we need to decide what we mean by "closest to the center of each cluster". The
objective function that is minimized by K-means is the sum of Euclidean distances between
each point and the cluster center, summed over all clusters. Therefore, it is natural to use
the Euclidean distance as our measure.
We will define this function here. Note that we will need access to certain imports from the
Breeze library (a dependency of MLlib) for linear algebra and vector-based numerical
functions:
import breeze.linalg._
import breeze.numerics.pow
def computeDistance(v1: DenseVector[Double], v2:
DenseVector[Double]) = pow(v1 - v2, 2).sum
Search WWH ::




Custom Search