Consolidated Trees: An Analysis of Structural Convergence - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

systems in a similar zone in the learning curve [11],[19]. We can not forget that

developing too much a classification tree leads to a greater probability of overtraining.

The validation methodology used in this experimentation has been to execute 5 times

a 10-fold stratified cross validation [11]. In each of the folds of the cross-validation

100 stratified subsamples have been extracted, always without replacement and with

size of 75% of the training sample in the corresponding fold. These subsamples have

been used to build both kinds of trees, CT and C4.5.

For CTC algorithm the subsamples have been used disjointedly to build the trees,

which has led to different number of instances of CTs when varying the

Number_Samples (N_S) parameter: N_S = 5 (20 trees), N_S = 10 (10 trees), N_S = 20

(5 trees), N_S = 30 (3 trees), N_S = 40 (2 trees) and N_S = 50 (2 trees). This means

that for each fold, 42 Consolidated Trees have been built.

For C4.5 algorithm different options have been tried:

C4.5 100 consists on building a tree with each one of the 100 subsamples mentioned

before, generated undersampling the training set (fold). The amount of information

of the original training set used by each algortihm is different in this case: a CT

sees more information than a C4.5 tree, which can lead to differences in accuracy.

This has led us to design another comparison, where both algorithms use the same

information (C4.5 union ).

•

The sample used to induce each one of the C4.5 union trees will be the union of the

subsamples used to build the corresponding CT. So, in this experimentation the

information handled by both algorithms is the same. In this case as many C4.5

trees as CTs are built.

•

Related to the previous one we made a third comparison among C4.5 and CTC

algorithm where the C4.5 trees have been built directly from the training data

belonging to each fold of the 10-fold cross-validation (C4.5 not resampling ). We can not

forget that this case can not be used when resampling is required. However we

think the comparison is interesting to appreciate correctly the achieved error rates.

The number of C4.5 trees generated is larger than the number of CT trees. We have

generated 100 C4.5 100 trees, 42 C4.5 union trees (same amount that CT trees) and one

C4.5 not resampling in each fold.

With this information we can quantify the number of trees generated for the wide

experimentation described in this section. For each of the 20 databases, 5 runs of 10

folds have been generated, so, for CTC algorithm, 42,000 trees have been built, and

for C4.5 algorithm, 100,000 (C4.5 100 ) + 42,000 (C4.5 union ) + 100 (C4.5 not resampling ).=

142,100 trees.

4 Summary of Previous Work

This section is devoted to present the results of different comparisons made among

the two algorithms (C4.5 and CTC).

The analysis has been made from two points of view: error and structural stability.

In order to evaluate the structural stability, a structural distance among the trees that

are being compared has been defined: Common . This structural measure is based on a

pair to pair comparison, Similarity, among all the trees of the set . This function

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home