Database Reference
In-Depth Information
Clustering with K-Means
As far as clustering algorithms (or as far as any algorithms) are concerned, K-Means is simple.
To group the input data into K clusters, we initially pick K random points in the data's domain
space. Now follow the steps listed here:
1.
Assign each of the data points to the nearest cluster
2.
Move the cluster's position to the centroid (or mean position) of the data assigned
to that cluster
3.
Repeat
We keep following these steps until either we've performed a maximum number of iterations
or until the clusters are stable, that is, when the set of data points assigned to each cluster
doesn't change. We saw this in the example in this recipe. The maximum number of iterations
was set to 100, but the clusters stabilized after eight iterations.
K-Means clustering does have a number of quirks to be aware of. First, it must be used
with numeric variables. After all, what would the distance be between two species in the
Iris dataset? What's the distance between Virginica and Setosa?
Another factor is that it won't work well if the natural classiications within the data
(for example, the species in the Iris dataset) aren't in separate circles. If the data points
for each class tend to run into each other, then K-Means won't be able to reliably distinguish
between the classiications.
Analyzing the results
The following graph shows petal dimensions of the items in the Iris dataset and distinguishes
each point by species (shape) and classiication (color). Generally, the results are good, but
I've highlighted half a dozen points that the algorithm put into the wrong category (some green
crosses or yellow diamonds):
 
Search WWH ::




Custom Search