k-Means Clustering - Data Mining for the Masses

Database Reference

In-Depth Information

target in the hopes of raising awareness, educating policy holders, and modifying behaviors that

will lead to lower incidence of heart disease among her employer's clients.

CHAPTER SUMMARY

k-Means clustering is a data mining model that falls primarily on the side of Classification when

referring to the Venn diagram from Chapter 1 (Figure 1-2). For this chapter's example, it does not

necessarily predict which insurance policy holders will or will not develop heart disease. It simply

takes known indicators from the attributes in a data set, and groups them together based on those

attributes' similarity to group averages. Because any attributes that can be quantified can also have

means calculated, k-means clustering provides an effective way of grouping observations together

based on what is typical or normal for that group. It also helps us understand where one group

begins and the other ends, or in other words, where the natural breaks occur between groups in a

data set.

k-Means clustering is very flexible in its ability to group observations together. The k-Means

operator in RapidMiner allows data miners to set the number of clusters they wish to generate, to

dictate the number of sample means used to determine the clusters, and to use a number of

different algorithms to evaluate means. While fairly simple in its set-up and definition, k-Means

clustering is a powerful method for finding natural groups of observations in a data set.

REVIEW QUESTIONS

1) What does the k in k-Means clustering stand for?

2) How are clusters identified? What process does RapidMiner use to define clusters and

place observations in a given cluster?

3) What does the Centroid Table tell the data miner? How do you interpret the values in a

Centroid Table?

4) How do descriptive statistics aid in the process of evaluating and deploying a k-Means

clustering model?

Search WWH ::

Custom Search

Home