Hybridization of Decision Trees with other Techniques - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

affect the final accuracy and will lead, on the other hand, to a complex

and less comprehensible decision tree (and hence a complex and less

comprehensible composite classifier). Moreover, since the classifiers are

required to generalize from the instances in their sub-spaces, they must

be trained on samples of sucient size.

Kohavi's stopping-rule can be revised into a rule that never considers

further splits in nodes that correspond to β

|

S

|

instances or less, where

0 <β< 1 is a proportion and

|

S

|

is the number of instances in original

training set, S . When using this stopping rule (either in Kohavi's way or

in the revised version), a threshold parameter must be provided to DFID

as well as to the function StoppingCriterion. Another heuristic stopping

rule is never to consider splitting a node, if a single classifier can accurately

describe the node's sub-space (i.e. if a single classifier which was trained by

all of the training instances, and using the classification method appear to

be accurate). Practically, this rule can be checked by comparing an accuracy

estimation of the classifier to a predefined threshold (thus, using this rule

requires an additional parameter). The motivation for this stopping rule is

that if a single classifier is good enough, why replace it with a more complex

tree that also has less generalization capabilities? Finally, as mentioned

above, another (inherent) stopping-rule of DFID is the lack of even a single

candidate attribute.

15.2.2

Splitting Rules

The core question of DFID is how to split nodes. The answer to this question

lies in the general function split (Figure 14.1). It should be noted that any

splitting rule that is used to grow a pure decision tree, is also suitable in

DFID.

Kohavi (1996) has suggested a new splitting rule, which selects the

attribute with the highest value of a measure, which he refers to as the

“utility”. Kohavi defines the utility as the fivefold cross-validation accuracy

estimation of using a naive-Bayes method for classifying the sub-spaces

which will be generated by the considered split.

15.2.3

Split Validation Examinations

Since splitting rules, are heuristic, it may be beneficial to regard the splits

they produce as recommendations that should be validated. Kohavi (1996)

validated a split by estimating the reduction in error, which is gained by the

split and comparing it to a predefined threshold of 5% (i.e. if it is estimated

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home