k-Means Clustering - Data Mining for the Masses

Database Reference

In-Depth Information

participate in the programs she will offer. She also understands that there are probably policy

holders with high weight and low cholesterol, those with high weight and high cholesterol, and

those with low weight and high cholesterol. She further recognizes there are likely to be a lot of

people somewhere in between. In order to accomplish her goal, she needs to search among the

thousands of policy holders to find groups of people with similar characteristics and craft

programs and communications that will be relevant and appealing to people in these different

groups.

DATA UNDERSTANDING

Using the insurance company's claims database, Sonia extracts three attributes for 547 randomly

selected individuals. The three attributes are the insured's weight in pounds as recorded on the

person's most recent medical examination, their last cholesterol level determined by blood work in

their doctor's lab, and their gender. As is typical in many data sets, the gender attribute uses 0 to

indicate Female and 1 to indicate Male. We will use this sample data from Sonia's employer's

database to build a cluster model to help Sonia understand how her company's clients, the health

insurance policy holders, appear to group together on the basis of their weights, genders and

cholesterol levels. We should remember as we do this that means are particularly susceptible to

undue influence by extreme outliers, so watching for inconsistent data when using the k-Means

clustering data mining methodology is very important.

DATA PREPARATION

As with previous chapters, a data set has been prepared for this chapter's example, and is available

as Chapter06DataSet.csv on the topic's companion web site. If you would like to follow along

with this example exercise, go ahead and download the data set now, and import it into your

RapidMiner data repository. At this point you are probably getting comfortable with importing

CSV data sets into a RapidMiner repository, but remember that the steps are outlined in Chapter 3

if you need to review them. Be sure to designate the attribute names correctly and to check your

data types as you import. Once you have imported the data set, drag it into a new, blank process

window so that you can begin to set up your k-means clustering data mining model. Your process

should look like Figure 6-1.

Search WWH ::

Custom Search

Home