Database Reference
In-Depth Information
affect the final accuracy and will lead, on the other hand, to a complex
and less comprehensible decision tree (and hence a complex and less
comprehensible composite classifier). Moreover, since the classifiers are
required to generalize from the instances in their sub-spaces, they must
be trained on samples of sucient size.
Kohavi's stopping-rule can be revised into a rule that never considers
further splits in nodes that correspond to β
|
S
|
instances or less, where
0 <β< 1 is a proportion and
|
S
|
is the number of instances in original
training set, S . When using this stopping rule (either in Kohavi's way or
in the revised version), a threshold parameter must be provided to DFID
as well as to the function StoppingCriterion. Another heuristic stopping
rule is never to consider splitting a node, if a single classifier can accurately
describe the node's sub-space (i.e. if a single classifier which was trained by
all of the training instances, and using the classification method appear to
be accurate). Practically, this rule can be checked by comparing an accuracy
estimation of the classifier to a predefined threshold (thus, using this rule
requires an additional parameter). The motivation for this stopping rule is
that if a single classifier is good enough, why replace it with a more complex
tree that also has less generalization capabilities? Finally, as mentioned
above, another (inherent) stopping-rule of DFID is the lack of even a single
candidate attribute.
15.2.2
Splitting Rules
The core question of DFID is how to split nodes. The answer to this question
lies in the general function split (Figure 14.1). It should be noted that any
splitting rule that is used to grow a pure decision tree, is also suitable in
DFID.
Kohavi (1996) has suggested a new splitting rule, which selects the
attribute with the highest value of a measure, which he refers to as the
“utility”. Kohavi defines the utility as the fivefold cross-validation accuracy
estimation of using a naive-Bayes method for classifying the sub-spaces
which will be generated by the considered split.
15.2.3
Split Validation Examinations
Since splitting rules, are heuristic, it may be beneficial to regard the splits
they produce as recommendations that should be validated. Kohavi (1996)
validated a split by estimating the reduction in error, which is gained by the
split and comparing it to a predefined threshold of 5% (i.e. if it is estimated
Search WWH ::




Custom Search