Building a Clustering Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Selecting K through cross-validation

As we have done with classification and regression models, we can apply cross-validation

techniques to select the optimal number of clusters for our model. This works in much the

same way as for supervised learning methods. We will split the dataset into a training set

and a test set. We will then train a model on the training set and compute the evaluation

metric of interest on the test set.

We will do this for the movie clustering using the built-in WCSS evaluation metric

provided by MLlib in the following code, using a 60 percent / 40 percent split between the

training set and test set:

val trainTestSplitMovies =

movieVectors.randomSplit(Array(0.6, 0.4), 123)

val trainMovies = trainTestSplitMovies(0)

val testMovies = trainTestSplitMovies(1)

val costsMovies = Seq(2, 3, 4, 5, 10, 20).map { k => (k,

KMeans.train(trainMovies, numIterations, k,

numRuns).computeCost(testMovies)) }

println("Movie clustering cross-validation:")

costsMovies.foreach { case (k, cost) => println(f"WCSS for

K=$k id $cost%2.2f") }

This should give results that look something like the ones shown here.

The output of movie clustering cross-validation is:

Movie clustering cross-validation

WCSS for K=2 id 942.06

WCSS for K=3 id 942.67

WCSS for K=4 id 950.35

WCSS for K=5 id 948.20

WCSS for K=10 id 943.26

WCSS for K=20 id 947.10

We can observe that the WCSS decreases as the number of clusters increases, up to a point.

It then begins to increase. Another common pattern observed in the WCSS in cross-valida-

tion for K-means is that the metric continues to decrease as K increases, but at a certain

point, the rate of decrease flattens out substantially. The value of K at which this occurs is

Search WWH ::

Custom Search

Home