Database Reference
In-Depth Information
[9,] 98 96 96
[10,] 99 99 95
To determine an appropriate value for k, the k-means algorithm is used to identify
clusters for k = 1, 2, …, 15. For each value of k, the WSS is calculated. If an
additional cluster provides a better partitioning of the data points, the WSS should
be markedly smaller than without the additional cluster.
The following R code loops through several k-means analyses for the number of
centroids, k , varying from 1 to 15. For each k, the option nstart =25 specifies
that the k-means algorithm will be repeated 25 times, each starting with k random
initial centroids. The corresponding value of WSS for each k-mean analysis is
stored in the wss vector.
wss <- numeric(15)
for (k in 1:15) wss[k] <- sum(kmeans(kmdata, centers=k,
nstart=25)$withinss)
Using the basic R plot function, each WSS is plotted against the respective number
of centroids, 1 through 15. This plot is provided in Figure 4.5 .
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within Sum of Squares")
Figure 4.5 WSS of the student grade data
As can be seen, the WSS is greatly reduced when k increases from one to two.
Another substantial reduction in WSS occurs at k = 3. However, the improvement
in WSS is fairly linear for k > 3. Therefore, the k-means analysis will be conducted
for k = 3. The process of identifying the appropriate value of k is referred to as
finding the “elbow” of the WSS curve.
Search WWH ::




Custom Search