Advanced Analytical Theory and Methods: Clustering - Data Science and Big Data Analytics

Database Reference

In-Depth Information

[9,] 98 96 96

[10,] 99 99 95

To determine an appropriate value for k, the k-means algorithm is used to identify

clusters for k = 1, 2, …, 15. For each value of k, the WSS is calculated. If an

additional cluster provides a better partitioning of the data points, the WSS should

be markedly smaller than without the additional cluster.

The following R code loops through several k-means analyses for the number of

centroids, k , varying from 1 to 15. For each k, the option nstart =25 specifies

that the k-means algorithm will be repeated 25 times, each starting with k random

initial centroids. The corresponding value of WSS for each k-mean analysis is

stored in the wss vector.

wss <- numeric(15)

for (k in 1:15) wss[k] <- sum(kmeans(kmdata, centers=k,

nstart=25)$withinss)

Using the basic R plot function, each WSS is plotted against the respective number

of centroids, 1 through 15. This plot is provided in Figure 4.5 .

plot(1:15, wss, type="b", xlab="Number of Clusters",

ylab="Within Sum of Squares")

Figure 4.5 WSS of the student grade data

As can be seen, the WSS is greatly reduced when k increases from one to two.

Another substantial reduction in WSS occurs at k = 3. However, the improvement

in WSS is fairly linear for k > 3. Therefore, the k-means analysis will be conducted

for k = 3. The process of identifying the appropriate value of k is referred to as

finding the “elbow” of the WSS curve.

Search WWH ::

Custom Search

Home