Database Reference
In-Depth Information
DATA CONSIDERATIONS FOR CLUSTERING MODELS
Clustering models are unsupervised techniques, appropriate when there is no
target output field. They analyze a set of inputs and group records with respect to
the identified input data patterns.
The algorithms presented here (K-means, the TwoStep, Kohonen networks)
work best with continuous numeric input fields. Although they can also handle
categorical inputs, a general recommendation would be to avoid using categorical
fields for clustering. The TwoStep model uses a specific distance measure that can
more efficiently handle mixed (both categorical and continuous) inputs. K-means
and Kohonen networks integrate an internal preprocessing encoding procedure
for categorical fields. Each categorical field is recoded as a set of flag (binary/
dichotomous) indicator fields, one such field for each original category. This recod-
ing is called indicator coding. For each record, the indicator field corresponding
to the category of the record is set to 1 and all other indicator fields are set
to 0. Despite this special handling, categorical clustering fields tend to dominate
the formation of the clusters and usually yield biased clustering solutions which
overlook the differences attributable to the rest of the inputs.
Often, input clustering fields are measured in different scales. Since clus-
tering models take into account the differences between records, differences in
measurement scales can lead to biased clustering solutions simply because some
fields might be measured in larger values. Fields measured in larger values show
increased variability. If used in their original scale they will dominate the cluster
solution. Thus, a standardization (normalization) process is necessary in order to
put fields into comparable scales and ensure that fields with larger values do not
determine the solution. The two most common standardization methods include
the z -score and the 0-1 (or min-max) approaches. In the z -score approach, the
standardized field is created as below:
(Record value
mean value of field)
/
standard deviation of the field
.
Resulting fields have a mean of 0 and a standard deviation of 1. The record
values denote the number of standard deviations above or below the overall mean
value.
The 0-1 approach rescales all record values in the range 0-1, by subtracting
the minimum from each value and dividing the difference by the range of
the field:
(Record value
minimum value of field)
/
(maximum value of field
minimum value of field)
.
Search WWH ::




Custom Search