Data Mining Techniques for Segmentation - Data Mining Techniques in CRM: Inside Customer Segmentation

Database Reference

In-Depth Information

DATA CONSIDERATIONS FOR CLUSTERING MODELS

Clustering models are unsupervised techniques, appropriate when there is no

target output field. They analyze a set of inputs and group records with respect to

the identified input data patterns.

The algorithms presented here (K-means, the TwoStep, Kohonen networks)

work best with continuous numeric input fields. Although they can also handle

categorical inputs, a general recommendation would be to avoid using categorical

fields for clustering. The TwoStep model uses a specific distance measure that can

more efficiently handle mixed (both categorical and continuous) inputs. K-means

and Kohonen networks integrate an internal preprocessing encoding procedure

for categorical fields. Each categorical field is recoded as a set of flag (binary/

dichotomous) indicator fields, one such field for each original category. This recod-

ing is called indicator coding. For each record, the indicator field corresponding

to the category of the record is set to 1 and all other indicator fields are set

to 0. Despite this special handling, categorical clustering fields tend to dominate

the formation of the clusters and usually yield biased clustering solutions which

overlook the differences attributable to the rest of the inputs.

Often, input clustering fields are measured in different scales. Since clus-

tering models take into account the differences between records, differences in

measurement scales can lead to biased clustering solutions simply because some

fields might be measured in larger values. Fields measured in larger values show

increased variability. If used in their original scale they will dominate the cluster

solution. Thus, a standardization (normalization) process is necessary in order to

put fields into comparable scales and ensure that fields with larger values do not

determine the solution. The two most common standardization methods include

the z -score and the 0-1 (or min-max) approaches. In the z -score approach, the

standardized field is created as below:

(Record value

−

mean value of field)

/

standard deviation of the field

.

Resulting fields have a mean of 0 and a standard deviation of 1. The record

values denote the number of standard deviations above or below the overall mean

value.

The 0-1 approach rescales all record values in the range 0-1, by subtracting

the minimum from each value and dividing the difference by the range of

the field:

(Record value

−

minimum value of field)

/

(maximum value of field

−

minimum value of field)

.

Data Mining Techniques in CRM: Inside Customer Segmentation

Search WWH ::

Custom Search

Home