Database Reference
In-Depth Information
overfit the training dataset since these models fail to provide generalizable
predictions and will most probably collapse when used on unseen data.
Analysts should develop stable models that capture general data patterns,
and one way to avoid unpleasant surprises is to ensure that the rules' support
is acceptably high. Therefore analysts should request a relatively high number
of records for the model's parent and child nodes. Although the respective
settings also depend on the total number of records available, it is generally
recommended to keep the number of records for the terminal nodes at least
above 100 and if possible between 200 and 300. This will ensure reasonably
large nodes and will eliminate the risk of modeling patterns that only apply
for the specific records analyzed.
Moreover, specific decision tree algorithms incorporate an integrated
pruning procedure which trims the tree after full growth. The main concept
behind pruning is to collapse specific tree branches in order to end up
with a smaller and more stable optimal subtree with a simpler structure, an
equivalent performance, but with better validation properties. Tree pruning
is a useful feature offered by specific decision tree algorithms and it should
always be selected where available.
Finally, as in any predictive model, the decision tree results should be
validated and evaluated, at a disjoint dataset, before their actual usage and
deployment.
Due to their transparency and their ability to examine the relation of many
inputs to the outcome, decision trees are commonly applied in clustering to profile
the revealed clusters. Moreover, they can be used as scoring models for assigning
new cases to the revealed clusters using the derived rule set.
Using Decision Trees for Cluster Profiling and Updating
A prerequisite for using a decision tree model for cluster profiling and
updating is that the relevant model presents a satisfactory level of accuracy
in predicting all revealed clusters. Underperforming models with high over-
all error (misclassification) rates would complicate instead of helping the
procedure.
Decision tree models can also help with the identification of records
that do not fit well in the revealed clusters. These models separate records
into homogeneous subsets (terminal nodes) with respect to the cluster
membership field (the target field). An efficient decision tree model will
successfully partition the dataset into pure subsets, dominated by a single
cluster. The records that the model fails to separate land on terminal nodes
Search WWH ::




Custom Search