Database Reference
In-Depth Information
are identical. In this case the value of #folds would be 7. The values shown in the
table are averages of the databases belonging to the corresponding cluster but taking
into account the 50 folds of the 5 runs. Notice that even if the number of converging
folds is sometimes very small, this does not mean that the trees are completely
different; the average common part of the compared trees (lower part of the table:
%Com ) is still important.
Table 4 shows that the number of converging folds increases with the parameter
Number_Samples for both algorithms. On the other hand, values obtained for CTC are
always much better than values for C4.5 union in the 3 clusters. Besides, in every
database the error of the CT trees is smaller than error of C4.5 union or C4.5 100 trees
and, as it can be observed in Table 2, most of the domains in Cluster1 are among the
databases where the differences are statistically significant.
The same kind of analysis has been done for trees built with C4.5 100 option. The
number of folds where all the trees converge to the same one is in this case 0 for
every database. The percentage of average common structure ( %Common ) is 28%
(See Table 3); even lower than the values obtained for CT trees with
Number_Samples =5 (40%).
Therefore the CTC algorithm provides a wider and steadier explanation with
smaller error rates.
6 Conclusions and Further Work
In order to afford the unsteadiness classification trees suffer when small changes in
the training set happen, we have developed a methodology for building classification
trees: Consolidated Trees' Construction Algorithm (CTC), being the objective to
maintain the explanation without losing accuracy. This paper focuses on the study of
the structural convergence of the algorithm.
The behaviour of the CTC algorithm has been compared to C4.5 for twenty
databases, 19 from the UCI Repository and one database from a real data application
from our environment.
The results show that CT trees tend to converge to a single tree when
Number_Samples is increased and the obtained classification trees achieve besides,
smaller error rates than C4.5. So we can say that this methodology builds structurally
more steady trees, giving stability to the explanation and with smaller error rate, so,
with higher quality in the explanation. This is essential for some specific domains:
medical diagnosis, fraud detection, etc.
Observing the results in structural stability we can conclude that the number of
samples required to achieve the structural convergence varies depending on the
database. We are analysing the convergence for larger values of the parameter
Number_Samples in order to find the needed number of samples to achieve the
convergence in each database. In this sense, the use of different parallelisation
techniques (shared memory and distributed memory computers) will be considered
due to the increase of computational cost.
Analysis of the results obtained for both algorithms with other instantiations of
Resampling_Mode parameter can also be interesting.
Search WWH ::




Custom Search