Database Reference
In-Depth Information
However, this correction still produces an optimistic error rate. Conse-
quently, Quinlan (1993) suggests pruning an internal node
t
if its error rate
is within one standard error from a reference tree, namely:
ε
(
T,S
)
·
(1
−
ε
(
T,S
))
ε
(
pruned
(
T,t
)
,S
)
ε
(
T,S
)+
≤
.
(6.4)
|
S
|
The last condition is based on the statistical confidence interval for
proportions. Usually, the last condition is used such that
T
refers to a sub-
tree whose root is the internal node
t
and
S
denotes the portion of the
training set that refers to the node
t
.
The pessimistic pruning procedure performs top-down traversal over
the internal nodes. If an internal node is pruned, then all its descendants
are removed from the pruning process, resulting in a relatively fast pruning.
6.2.6
)
EBP is an evolution of the pessimistic pruning. It is implemented in the
well-known C4.5 algorithm.
As in pessimistic pruning, the error rate is estimated using the upper
bound of the statistical confidence interval for proportions.
Error-Based Pruning
(
EBP
ε
(
T,S
)
·
(1
−
ε
(
T,S
))
ε
UB
(
T,S
)=
ε
(
T,S
)+
Z
α
·
,
(6.5)
|
S
|
where
ε
(
T,S
) denotes the misclassification rate of the tree
T
on the training
set
S
;
Z
is the inverse of the standard normal cumulative distribution; and
α
is the desired significance level.
Let
subtree
(
T,t
) denote the subtree rooted by the node
t
.Let
maxchild
(
T,t
) denote the most frequent child node of
t
(namely most of
the instances in
S
reach this particular child) and let
S
t
denote all instances
in
S
that reach the node
t
. The procedure traverses bottom-up all nodes
and compares the following values:
(1)
ε
UB
(
subtree
(
T,t
)
,S
t
)
(2)
ε
UB
(
pruned
(
subtree
(
T,t
)
,t
)
,S
t
)
(3)
ε
UB
(
subtree
(
T,maxchild
(
T,t
))
,S
maxchild
(
T,t
)
).
According to the lowest value, the procedure either leaves the tree as
is; prune the node
t
; or replaces the node
t
with the subtree rooted by
maxchild
(
T,t
).
Search WWH ::
Custom Search