Pruning Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

6.3 Optimal Pruning

The issue of finding the optimal pruning method has been studied by

Bratko and Bohanec (1994) and Almuallim (1996). The Optimal Pruning

Technique (OPT) algorithm guarantees optimality based on dynamic

programming, with the complexity of Θ(

| 2 ), where T is the

initial decision tree. Almuallim (1996) introduced an improvement of

OPT called OPT-2, which also performs optimal pruning using dynamic

programming. However, the time and space complexities of OPT-2 are both

Θ(

|

leaves ( T )

), where T ∗ is the target (pruned) decision tree

and T is the initial decision tree.

Since the pruned tree is habitually much smaller than the initial tree

and the number of internal nodes is smaller than the number of leaves,

OPT-2 is usually more ecient than OPT in terms of computational

complexity.

According to Almuallim, when pruning a given decision tree ( DT )

with s leaves, it is common to progressively replace various sub-trees

of DT by leaves, thus leading to a sequence of pruned DT, DT s− 1 ,

DT s− 2 ,..., DT i ,..., DT 1 , such that each DT i has at most i leaves. When

seeing the error as the difference from the original tree ( DT ), the trees in

the sequence are of increasing error. The goal in choosing the best tree from

the above sequence of pruned trees, according to some appropriate criteria,

is to achieve a good balance between the trees size and accuracy.

In their paper, Bratko and Bohanec (1994) address the following prob-

lem: “Given a completely accurate but complex definition of a concept,

simplify the definition, possibly at the expanse of accuracy, so that the

simplified definition still corresponds to the concept 'suciently' well”. In

this context, “concepts” are represented by decision trees and the method

of simplification is tree pruning. Therefore, the problem can be stated as:

“Given a decision tree that accurately specifies a concept, the problem is to

find the smallest pruned tree that still represents the concept within some

specified accuracy.”

In a new algorithm [Almuallim (1996)], OPT-2, which also performs

optimal pruning, is introduced. The problem of optimal pruning is formally

presented in this paper as follows: “Given a DT and a positive integer

C , find a pruned DT from DT such that s ( DT )

leaves ( T ∗ )

|

|·|

internal ( T )

|

C and error( DT )is

minimized.” where s ( DT )isthesizeof DT (number of leaves in DT )and

error( DT ) is the proportion of cases that are incorrectly handled by DT .

The OPT-2 algorithm is based on dynamic programming. In its most

basic form, the time and space complexities of OPT-2 are both Θ( nC ),

≤

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home