Pruning Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

However, this correction still produces an optimistic error rate. Conse-

quently, Quinlan (1993) suggests pruning an internal node t if its error rate

is within one standard error from a reference tree, namely:

ε ( T,S )

·

(1

−

ε ( T,S ))

ε ( pruned ( T,t ) ,S )

ε ( T,S )+

≤

.

(6.4)

|

S

|

The last condition is based on the statistical confidence interval for

proportions. Usually, the last condition is used such that T refers to a sub-

tree whose root is the internal node t and S denotes the portion of the

training set that refers to the node t .

The pessimistic pruning procedure performs top-down traversal over

the internal nodes. If an internal node is pruned, then all its descendants

are removed from the pruning process, resulting in a relatively fast pruning.

6.2.6

)

EBP is an evolution of the pessimistic pruning. It is implemented in the

well-known C4.5 algorithm.

As in pessimistic pruning, the error rate is estimated using the upper

bound of the statistical confidence interval for proportions.

Error-Based Pruning

(

EBP

ε ( T,S )

·

(1

−

ε ( T,S ))

ε UB ( T,S )= ε ( T,S )+ Z α ·

,

(6.5)

|

S

|

where ε ( T,S ) denotes the misclassification rate of the tree T on the training

set S ; Z is the inverse of the standard normal cumulative distribution; and

α is the desired significance level.

Let subtree ( T,t ) denote the subtree rooted by the node t .Let

maxchild ( T,t ) denote the most frequent child node of t (namely most of

the instances in S reach this particular child) and let S t denote all instances

in S that reach the node t . The procedure traverses bottom-up all nodes

and compares the following values:

(1) ε UB ( subtree ( T,t ) ,S t )

(2) ε UB ( pruned ( subtree ( T,t ) ,t ) ,S t )

(3) ε UB ( subtree ( T,maxchild ( T,t )) ,S maxchild ( T,t ) ).

According to the lowest value, the procedure either leaves the tree as

is; prune the node t ; or replaces the node t with the subtree rooted by

maxchild ( T,t ).

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home