Database Reference
In-Depth Information
Selecting K through cross-validation
As we have done with classification and regression models, we can apply cross-validation
techniques to select the optimal number of clusters for our model. This works in much the
same way as for supervised learning methods. We will split the dataset into a training set
and a test set. We will then train a model on the training set and compute the evaluation
metric of interest on the test set.
We will do this for the movie clustering using the built-in WCSS evaluation metric
provided by MLlib in the following code, using a 60 percent / 40 percent split between the
training set and test set:
val trainTestSplitMovies =
movieVectors.randomSplit(Array(0.6, 0.4), 123)
val trainMovies = trainTestSplitMovies(0)
val testMovies = trainTestSplitMovies(1)
val costsMovies = Seq(2, 3, 4, 5, 10, 20).map { k => (k,
KMeans.train(trainMovies, numIterations, k,
numRuns).computeCost(testMovies)) }
println("Movie clustering cross-validation:")
costsMovies.foreach { case (k, cost) => println(f"WCSS for
K=$k id $cost%2.2f") }
This should give results that look something like the ones shown here.
The output of movie clustering cross-validation is:
Movie clustering cross-validation
WCSS for K=2 id 942.06
WCSS for K=3 id 942.67
WCSS for K=4 id 950.35
WCSS for K=5 id 948.20
WCSS for K=10 id 943.26
WCSS for K=20 id 947.10
We can observe that the WCSS decreases as the number of clusters increases, up to a point.
It then begins to increase. Another common pattern observed in the WCSS in cross-valida-
tion for K-means is that the metric continues to decrease as K increases, but at a certain
point, the rate of decrease flattens out substantially. The value of K at which this occurs is
Search WWH ::




Custom Search