Pruning Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

where n is the number of internal nodes in the initial decision tree and

C is the number of leaves in the target (pruned) decision tree. This

is an improvement of the OPT algorithm presented by Bohanec and

Bratko (1994). One important characteristic of the OPT-2 algorithm is its

significant flexibility. The OPT-2 works sequentially, generating the trees

of the sequence one after the other in increasing order of the number

of leaves (that is in the order DT 1 , DT 2 , DT 3 ,.... ). This differs from

OPT which simultaneously generates the whole sequence of pruned trees.

Sequence generation in OPT-2 can be terminated once a tree with adequate

predetermined criteria is found.

Given that trees DT 1 , DT 2 ,..., DT i− 1 have already been generated,

OPT-2 finds DT i in time Θ( n ), where n is the number of the internal nodes

of the initial decision tree. Thus, if the number of leaves of the target tree is

C , the total running time of OPT-2 will be Θ( nC ). Since the goal is to prune

the tree, C is habitually much smaller then s . Additionally, n is smaller than

s , specifically in the case of attributes with many values. Hence, OPT-2 is

usually more ecient than OPT in terms of execution time.

Although both OPT and OPT-2 are based on dynamic programming

the two algorithms differ substantially in the way the problem is divided

into sub-problems. The approach adopted in OPT is usually viewed as a

“bottom-up” approach, where the idea is to compute solutions for the sub-

trees rooted at the lowest level of the tree. These solutions are then used to

compute the nodes of the next upper level. This process is repeated until the

root of the initial tree is reached. A basic characteristic of this “bottom-up”

approach is that the trees DT i of the pruning sequence are simultaneously

computed as we advance towards the root. These pruned trees are not final

unless we reach the root. However, once it is reached, the whole sequence

becomes available at once.

The OPT-2 produces the pruned trees one after the other using an

algorithm derivative of a dynamic programming method given by Johnson

and Niemi (1983). Unlike the bottom-up approach of OPT, processing

in OPT-2 is done in a left to right fashion: Given that we have already

computed trees DT 1 , DT 2 ,..., DT i− 1 and that necessary intermediate

results for these computations are kept in memory the OPT-2 algorithm

finds DT i from these through a linear 'left to right' pass over the tree.

In optimal pruning the error of each DT i in the sequence should be a

minimum over all pruned trees of i (or less) leaves. To date, the only two

works that address optimal pruning are Bohanec and Bratko's paper (1994)

and Almuallim's paper (1996).

Search WWH ::

Custom Search

Home