Consolidated Trees: An Analysis of Structural Convergence - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

the error is smaller for CTC than for C4.5. The statistically significant differences

(paired t-test [5],[6]), with 95% confidence level, have been marked in italics. The

differences are statistically significant in 11 databases for C4.5 100 , and 10 databases

for C4.5 union . In the databases where results for C4.5 100 or C4.5 union are better, the

differences are not statistically significant. The differences with results of

C4.5 not_resampling are never statistically significant being the behaviour of CTC better in

average. So we can ensure that the discriminating capacity of CTC algorithm is at

least as good or better than the one of C4.5 . In this situation, it is worth the

comparison of the structural stability of the different classifiers. Achieving greater

structural stability will mean that CT trees have better explaining capacity. The data

show that CTs achieve higher structural stability than C4.5 100 (in average 8.46

compared to 3.24) and C4.5 not_resampling (in average 8.46 compared to 5.60).

Looking to the values of Common obtained for C4.5 union we could say that they

achieve higher structural stability than CTC ( Common is in average 23.44 compared

to 8.46) but this happens because complexity of C4.5 union trees is an order of

magnitude larger than the complexity of CTs. In environments where explanation and

therefore stability is important so complex trees are not useful. Moreover, being the

error smaller for CTC, the principle of parsimony of the model makes worse the

C4.5 union option. More information about this experimentation can be found in [14].

Therefore, we can say that in average, classification trees induced with CTC

algorithm have lower error rate than those induced with C4.5, and they are

structurally steadier. As a consequence they provide a wider and steadier explanation,

that allows to deal with the problem of the excessive sensitivity classification trees

have to resampling methods.

5 Analysis of Convergence

We have observed that the value of Common for CT trees increases with the number

of used subsamples. This means that the CT trees tend to have a larger common

structure when Number_Samples increases. This is a desirable behaviour but it could

be due to the higher complexity of the trees (this was the case of C4.5 union in previous

section). In order to take into account the parsimony principle we have normalised the

Common value in respect to the trees' size (number of internal nodes). We will

denominate this measure %Common and it will quantify the identical fraction of two

or more trees.

The information in Fig 2. belongs to one run of the 10 fold cross-validation for

Breast-W database. The curves represent the values of %Common in each one of the

folds when the Number_Samples parameter varies. We will give some clues for better

understanding the figure: obtaining a value of 100% for % Common in a set of trees

means that all the compared trees are equal; obtaining a value of 90% means that in

average the compared trees have 90% of the structure identical.

Each line in Fig. 2 represents for CTC algorithm (left side) and C4.5 algorithm

(right side), the evolution of % Common when the number of samples used to build the

trees increases in one fold. The number of trees compared in each fold varies with

Number_Samples parameter. For N_S = 5, 20 trees are compared in each fold and it

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home