Database Reference
In-Depth Information
IBM SPSS Modeler integrates internal standardization methods for all the
algorithms presented here, so there is no need for the user to compensate for
different scales. In general, though, standardization is a preprocessing step that
should precede all clustering models. As a reminder, we note that principal
component scores are standardized and this is an additional advantage gained by
using PCA as the first step for clustering.
The presented algorithms also differ in regard to the handling of missing
(null) values. In IBM SPSS Modeler, K-means and Kohonen networks impute
the null values with ''neutral'' values. Missing values of numeric and categorical
flag fields (dichotomous or binary) are substituted with a 0.5 value (remember
that numeric fields are by default internally standardized in the range 0-1). For
categorical set fields with more than two outcomes, the derived indicator fields are
set to 0. Consequently, null values affect model training and, moreover, new cases
with nulls are also scored and assigned to one of the identified clusters. Records
with null values are not supported in TwoStep models. They are excluded from the
TwoStep model training and new records with nulls are not scored and assigned to
acluster.
Another important issue to consider in clustering is the effect of possible
outliers. Outliers are records with extreme values and unusual data patterns.
They can be identified by examining data through simple descriptive statistics.
IBM SPSS Modeler offers a data exploration tool (Data Audit) that provides basic
statistics (average, minimum, and maximum value) and also identifies outlier values
by examining deviations from the overall mean. Outliers can also be spotted by
specialized modeling techniques like IBM SPSS Modeler's Anomaly Detection
algorithm, which looks at full records and identifies unusual data patterns. Outlier
records deserve special investigation as in many cases they are what we are
looking for: exceptionally good customers or, at the other extreme, fraudulent
cases. But they may have a negative impact on clustering models that aim
at investigating ''typical behaviors.'' They can confuse the clustering algorithm
and lead to poor and distorted results. In many cases, the differences between
''outlier'' and ''normal'' data patterns are so large that they may mask the existing
differences among the majority of ''normal'' cases. As a result, the algorithm can
be guided to a degenerate solution that merely separates outliers from the rest
of the records. Consequently, the clustering model may come up with a poor
solution consisting of one large cluster of ''normal'' behavior and many very small
clusters representing the unusual data patterns. This analysis may be useful, for
instance, in fraud detection, but certainly it is not appropriate in the case of
general purpose segmentation. Although the standardization process smoothes
values and reduces the level of outlier influence in the formation of the clusters,
a recommended approach for an enriched general purpose solution would be to
identify outliers and treat them in a special manner. IBMSPSS Modeler's TwoStep
Search WWH ::




Custom Search