Consolidated Trees: An Analysis of Structural Convergence - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

The different subsamples are used to make proposals about the feature that should

be used to split in the current node. The split function used in this work is the gain

ratio criterion (the same used by Quinlan in C4.5). The decision about which feature

will be used to make the split in a node of the Consolidated Tree (CT) is accorded

among the different proposals. The decision is made by a voting process node by

node. Based on this decision, all the subsamples are divided using the same feature.

The iterative process is described in Algorithm 1.

The algorithm starts extracting a set of subsamples ( Number_Samples ) from the

original training set. The subsamples can be obtained based on the desired resampling

technique ( Resampling_Mode ).

Decision tree's construction algorithms, usually divide the initial sample in several

data partitions. In our algorithm, LS i contains the data partitions created from each

subsample S i .

Algorithm 1. Consolidated Trees' Construction Algorithm (CTC)

Generate Number_Samples subsamples ( S i ) from S with Resampling_Mode method.

CurrentNode := RootNode

for i := 1 to Number_Samples

LS i := { S i }

end for

repeat

for i := 1 to Number_Samples

CurrentS i := First ( LS i )

LS i := LS i - CurrentS i

Induce the best split (X,B) i for CurrentS i

end for

Obtain the consolidated pair ( X c ,B c ), based on (X,B) i , 1 ≤ i ≤ Number_Samples

if ( X c ,B c ) ≠ Not_Split

Split CurrentNode based on ( X c ,B c )

for i := 1 to Number_Samples

Divide CurrentS i based on ( X c ,B c ) to obtain n subsamples { S 1 i , … S n i }

LS i := { S 1 i , … S n i } + LS i

end for

else consolidate CurrentNode as a leaf

end if

Decision in

each subsample

Force in all

subsamples

(Consolidate)

CurrentNode := NextNode

until

i, LS i is empty

∀

In the algorithm, the consolidation of a node is divided in two main parts. The first

one where a separate analysis is done in each of the subsamples, and the second one,

where based on all the proposals, a decision to consolidate the node is made. At this

point, all the subsamples are forced to make the same split.

The pair (X,B) i is the split proposal for the first partition in LS i . X is the feature

selected to split and B indicates the proposed branches or criteria to divide the data in

the current node. X c is the feature obtained by a voting process among all the

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home