Clustering, Classifying, and Working with Weka - Clojure Data Analysis

Database Reference

In-Depth Information

Clustering with K-Means

As far as clustering algorithms (or as far as any algorithms) are concerned, K-Means is simple.

To group the input data into K clusters, we initially pick K random points in the data's domain

space. Now follow the steps listed here:

1.

Assign each of the data points to the nearest cluster

2.

Move the cluster's position to the centroid (or mean position) of the data assigned

to that cluster

3.

Repeat

We keep following these steps until either we've performed a maximum number of iterations

or until the clusters are stable, that is, when the set of data points assigned to each cluster

doesn't change. We saw this in the example in this recipe. The maximum number of iterations

was set to 100, but the clusters stabilized after eight iterations.

K-Means clustering does have a number of quirks to be aware of. First, it must be used

with numeric variables. After all, what would the distance be between two species in the

Iris dataset? What's the distance between Virginica and Setosa?

Another factor is that it won't work well if the natural classiications within the data

(for example, the species in the Iris dataset) aren't in separate circles. If the data points

for each class tend to run into each other, then K-Means won't be able to reliably distinguish

between the classiications.

Analyzing the results

The following graph shows petal dimensions of the items in the Iris dataset and distinguishes

each point by species (shape) and classiication (color). Generally, the results are good, but

I've highlighted half a dozen points that the algorithm put into the wrong category (some green

crosses or yellow diamonds):

Search WWH ::

Custom Search

Home