k-Means Clustering - Data Mining for the Masses

Database Reference

In-Depth Information

MODELING

The ' k ' in k-means clustering stands for some number of groups, or clusters. The aim of this data

mining methodology is to look at each observation's individual attribute values and compare them

to the means, or in other words averages, of potential groups of other observations in order to find

natural groups that are similar to one another. The k-means algorithm accomplishes this by

sampling some set of observations in the data set, calculating the averages, or means, for each

attribute for the observations in that sample, and then comparing the other attributes in the data

set to that sample's means. The system does this repetitively in order to 'circle-in' on the best

matches and then to formulate groups of observations which become the clusters. As the means

calculated become more and more similar, clusters are formed, and each observation whose

attributes values are most like the means of a cluster become members of that cluster. Using this

process, k-means clustering models can sometimes take a long time to run, especially if you

indicate a large number of “max runs” through the data, or if you seek for a large number of

clusters ( k ). To build your k-means cluster model, complete the following steps:

1) Return to design view in RapidMiner if you have not done so already. In the operators

search box, type k-means (be sure to include the hyphen). There are three operators that

conduct k-means clustering work in RapidMiner. For this exercise, we will choose the first,

which is simply named “k-Means”. Drag this operator into your stream, and shown in

Figure 6-3.

Figure 6-3. Adding the k-Means operator to our model.

Search WWH ::

Custom Search

Home