Consolidated Trees: An Analysis of Structural Convergence - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

are identical. In this case the value of #folds would be 7. The values shown in the

table are averages of the databases belonging to the corresponding cluster but taking

into account the 50 folds of the 5 runs. Notice that even if the number of converging

folds is sometimes very small, this does not mean that the trees are completely

different; the average common part of the compared trees (lower part of the table:

%Com ) is still important.

Table 4 shows that the number of converging folds increases with the parameter

Number_Samples for both algorithms. On the other hand, values obtained for CTC are

always much better than values for C4.5 union in the 3 clusters. Besides, in every

database the error of the CT trees is smaller than error of C4.5 union or C4.5 100 trees

and, as it can be observed in Table 2, most of the domains in Cluster1 are among the

databases where the differences are statistically significant.

The same kind of analysis has been done for trees built with C4.5 100 option. The

number of folds where all the trees converge to the same one is in this case 0 for

every database. The percentage of average common structure ( %Common ) is 28%

(See Table 3); even lower than the values obtained for CT trees with

Number_Samples =5 (40%).

Therefore the CTC algorithm provides a wider and steadier explanation with

smaller error rates.

6 Conclusions and Further Work

In order to afford the unsteadiness classification trees suffer when small changes in

the training set happen, we have developed a methodology for building classification

trees: Consolidated Trees' Construction Algorithm (CTC), being the objective to

maintain the explanation without losing accuracy. This paper focuses on the study of

the structural convergence of the algorithm.

The behaviour of the CTC algorithm has been compared to C4.5 for twenty

databases, 19 from the UCI Repository and one database from a real data application

from our environment.

The results show that CT trees tend to converge to a single tree when

Number_Samples is increased and the obtained classification trees achieve besides,

smaller error rates than C4.5. So we can say that this methodology builds structurally

more steady trees, giving stability to the explanation and with smaller error rate, so,

with higher quality in the explanation. This is essential for some specific domains:

medical diagnosis, fraud detection, etc.

Observing the results in structural stability we can conclude that the number of

samples required to achieve the structural convergence varies depending on the

database. We are analysing the convergence for larger values of the parameter

Number_Samples in order to find the needed number of samples to achieve the

convergence in each database. In this sense, the use of different parallelisation

techniques (shared memory and distributed memory computers) will be considered

due to the increase of computational cost.

Analysis of the results obtained for both algorithms with other instantiations of

Resampling_Mode parameter can also be interesting.

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home