Data Mining Techniques for Segmentation - Data Mining Techniques in CRM: Inside Customer Segmentation

Database Reference

In-Depth Information

overfit the training dataset since these models fail to provide generalizable

predictions and will most probably collapse when used on unseen data.

Analysts should develop stable models that capture general data patterns,

and one way to avoid unpleasant surprises is to ensure that the rules' support

is acceptably high. Therefore analysts should request a relatively high number

of records for the model's parent and child nodes. Although the respective

settings also depend on the total number of records available, it is generally

recommended to keep the number of records for the terminal nodes at least

above 100 and if possible between 200 and 300. This will ensure reasonably

large nodes and will eliminate the risk of modeling patterns that only apply

for the specific records analyzed.

Moreover, specific decision tree algorithms incorporate an integrated

pruning procedure which trims the tree after full growth. The main concept

behind pruning is to collapse specific tree branches in order to end up

with a smaller and more stable optimal subtree with a simpler structure, an

equivalent performance, but with better validation properties. Tree pruning

is a useful feature offered by specific decision tree algorithms and it should

always be selected where available.

Finally, as in any predictive model, the decision tree results should be

validated and evaluated, at a disjoint dataset, before their actual usage and

deployment.

Due to their transparency and their ability to examine the relation of many

inputs to the outcome, decision trees are commonly applied in clustering to profile

the revealed clusters. Moreover, they can be used as scoring models for assigning

new cases to the revealed clusters using the derived rule set.

Using Decision Trees for Cluster Profiling and Updating

A prerequisite for using a decision tree model for cluster profiling and

updating is that the relevant model presents a satisfactory level of accuracy

in predicting all revealed clusters. Underperforming models with high over-

all error (misclassification) rates would complicate instead of helping the

procedure.

Decision tree models can also help with the identification of records

that do not fit well in the revealed clusters. These models separate records

into homogeneous subsets (terminal nodes) with respect to the cluster

membership field (the target field). An efficient decision tree model will

successfully partition the dataset into pure subsets, dominated by a single

cluster. The records that the model fails to separate land on terminal nodes

Search WWH ::

Custom Search

Home