Database Reference
In-Depth Information
The TwoStep algorithm does not require the user to set in advance the
number of clusters to fit. It can suggest a clustering solution automatically: the
''optimal'' number of clusters can be automatically determined by the algorithm
according to a criterion that takes the following into account:
• The goodness of fit of the solution (the Bayes information criterion (BIC) or
Schwartz Bayesian criterion), that is, how well the specific data are fitted by the
current number of clusters.
• The distance measure for merging the subclusters. The algorithm examines the
final steps in the hierarchical procedure and tries to spot a sudden increase in
the merging distances. This point indicates that the agglomerative procedure
has started to join dissimilar clusters and hence indicates the correct number of
clusters to fit.
Another advantage of the TwoStep algorithm is that it integrates an outlier
handling option that minimizes the effects of noisy records which otherwise
could distort the segmentation solution. Outlier identification takes place in the
pre-clustering phase. Small pre-clusters with few members compared to other
pre-clusters (less than 25% of the largest pre-cluster) are considered as potential
outliers. These outlier records are set aside and the pre-clustering procedure is
rerun without them. Outliers that still cannot fit the revised pre-cluster solution
are filtered out from the next step of hierarchical clustering and do not participate
in the formation of the final clusters. Instead, they are assigned to a ''noise'' cluster.
One drawback of this algorithm is that, to some extent, it depends on the order of
the input data. It is recommended that data are reordered randomly before model
training.
Recommended TwoStep Options
Figure 3.8 and Table 3.10 outline the recommended approach for applying the
TwoStep clustering algorithm in IBM SPSS Modeler.
Table 3.10 IBM SPSS Modeler recommended TwoStep options.
Option
Setting
Functionality/reasoning for selection
Automatically
calculate
number of
clusters
Selected
This option enables automatic clustering and lets the
algorithm suggest the number of clusters to fit.
The ''maximum'' and ''minimum'' text boxes allow
the analysts to restrict the range of solutions to be
evaluated. The default setting limits the algorithm to
evaluate the last 15 steps of the hierarchical clustering
procedure and to propose a solution comprising a
minimum of 2 and up to a maximum of 15 clusters
Search WWH ::




Custom Search