Data Mining Techniques for Segmentation - Data Mining Techniques in CRM: Inside Customer Segmentation

Database Reference

In-Depth Information

The TwoStep algorithm does not require the user to set in advance the

number of clusters to fit. It can suggest a clustering solution automatically: the

''optimal'' number of clusters can be automatically determined by the algorithm

according to a criterion that takes the following into account:

• The goodness of fit of the solution (the Bayes information criterion (BIC) or

Schwartz Bayesian criterion), that is, how well the specific data are fitted by the

current number of clusters.

• The distance measure for merging the subclusters. The algorithm examines the

final steps in the hierarchical procedure and tries to spot a sudden increase in

the merging distances. This point indicates that the agglomerative procedure

has started to join dissimilar clusters and hence indicates the correct number of

clusters to fit.

Another advantage of the TwoStep algorithm is that it integrates an outlier

handling option that minimizes the effects of noisy records which otherwise

could distort the segmentation solution. Outlier identification takes place in the

pre-clustering phase. Small pre-clusters with few members compared to other

pre-clusters (less than 25% of the largest pre-cluster) are considered as potential

outliers. These outlier records are set aside and the pre-clustering procedure is

rerun without them. Outliers that still cannot fit the revised pre-cluster solution

are filtered out from the next step of hierarchical clustering and do not participate

in the formation of the final clusters. Instead, they are assigned to a ''noise'' cluster.

One drawback of this algorithm is that, to some extent, it depends on the order of

the input data. It is recommended that data are reordered randomly before model

training.

Recommended TwoStep Options

Figure 3.8 and Table 3.10 outline the recommended approach for applying the

TwoStep clustering algorithm in IBM SPSS Modeler.

Table 3.10 IBM SPSS Modeler recommended TwoStep options.

Option

Setting

Functionality/reasoning for selection

Automatically

calculate

number of

clusters

Selected

This option enables automatic clustering and lets the

algorithm suggest the number of clusters to fit.

The ''maximum'' and ''minimum'' text boxes allow

the analysts to restrict the range of solutions to be

evaluated. The default setting limits the algorithm to

evaluate the last 15 steps of the hierarchical clustering

procedure and to propose a solution comprising a

minimum of 2 and up to a maximum of 15 clusters

Data Mining Techniques in CRM: Inside Customer Segmentation

Search WWH ::

Custom Search

Home