Data Mining Techniques for Segmentation - Data Mining Techniques in CRM: Inside Customer Segmentation

Database Reference

In-Depth Information

IBM SPSS Modeler integrates internal standardization methods for all the

algorithms presented here, so there is no need for the user to compensate for

different scales. In general, though, standardization is a preprocessing step that

should precede all clustering models. As a reminder, we note that principal

component scores are standardized and this is an additional advantage gained by

using PCA as the first step for clustering.

The presented algorithms also differ in regard to the handling of missing

(null) values. In IBM SPSS Modeler, K-means and Kohonen networks impute

the null values with ''neutral'' values. Missing values of numeric and categorical

flag fields (dichotomous or binary) are substituted with a 0.5 value (remember

that numeric fields are by default internally standardized in the range 0-1). For

categorical set fields with more than two outcomes, the derived indicator fields are

set to 0. Consequently, null values affect model training and, moreover, new cases

with nulls are also scored and assigned to one of the identified clusters. Records

with null values are not supported in TwoStep models. They are excluded from the

TwoStep model training and new records with nulls are not scored and assigned to

acluster.

Another important issue to consider in clustering is the effect of possible

outliers. Outliers are records with extreme values and unusual data patterns.

They can be identified by examining data through simple descriptive statistics.

IBM SPSS Modeler offers a data exploration tool (Data Audit) that provides basic

statistics (average, minimum, and maximum value) and also identifies outlier values

by examining deviations from the overall mean. Outliers can also be spotted by

specialized modeling techniques like IBM SPSS Modeler's Anomaly Detection

algorithm, which looks at full records and identifies unusual data patterns. Outlier

records deserve special investigation as in many cases they are what we are

looking for: exceptionally good customers or, at the other extreme, fraudulent

cases. But they may have a negative impact on clustering models that aim

at investigating ''typical behaviors.'' They can confuse the clustering algorithm

and lead to poor and distorted results. In many cases, the differences between

''outlier'' and ''normal'' data patterns are so large that they may mask the existing

differences among the majority of ''normal'' cases. As a result, the algorithm can

be guided to a degenerate solution that merely separates outliers from the rest

of the records. Consequently, the clustering model may come up with a poor

solution consisting of one large cluster of ''normal'' behavior and many very small

clusters representing the unusual data patterns. This analysis may be useful, for

instance, in fraud detection, but certainly it is not appropriate in the case of

general purpose segmentation. Although the standardization process smoothes

values and reduces the level of outlier influence in the formation of the clusters,

a recommended approach for an enriched general purpose solution would be to

identify outliers and treat them in a special manner. IBMSPSS Modeler's TwoStep

Search WWH ::

Custom Search

Home