Information Technology Reference
In-Depth Information
The MEE algorithm produces, on average, smaller trees than the C4.5
algorithm [152, 153] and CART-IG [152] with better generalization and
without significant sacrifice on performance. Denoting by P ed and P et re-
spectively the mean training set error rate (resubstitution estimate) and
the mean test set error rate (cross-validation estimate), the generalization
was evaluated by computing D =
P ed
P et |
/s , with the pooled standard
deviation s . The Friedman test found a significant difference ( p
|
0) of the
methods for the D scores with the post-hoc Dunn-Sidak statistical test re-
vealing a significant difference between MEE vs C4.5 and vs CART-TWO
as illustrated in Fig. 6.40.
Judging from the tree size ranges [139], the MEE algorithm is significantly
more stable than competing algorithms.
The MEE algorithm is quite insensitive to pruning, at least when cost-
complexity pruning is used. With this pruning method the solutions with
or without pruning were found to be not significantly different, and as a
matter of fact were coincident in many cases [152].
Dunn-Sidak comparison intervals for the D scores.
Fig. 6.40
Over-fitting of tree solutions to training sets is the reason why the pruning
operation is always performed by tree design algorithms. It can be detected
during the design phase by setting aside a test set and looking to its error
rate during tree growing; over-fitting is then revealed by an inflected test error
curve, going upward after a minimum. Figure 6.41 shows, for the consecu-
tive tree levels, the mean training set and test set error rates (
standard
deviation) in 20 experiments of a MEE tree designed for the Ionosphere
dataset [13]. The training set used 85% of randomly chosen cases and testing
was performed in the remaining 15%. There is no evidence of over-fitting in
this case. According to [152] in only 1/8 of the datasets the MEE design re-
vealed mild over-fitting symptoms (in the last one or two levels). In that work
a comparison between test set error rates of pruned and unpruned solutions
is also reported in detail; no statistical significant difference ( p =0 . 41) was
found between the two groups of designed solutions.
Finally, a comparison of computation times is also presented in [152]. When
the only difference among the algorithms is the implementation of the split-
ting criteria, then the MEE tree algorithm may take substantially less time to
±
 
Search WWH ::




Custom Search