Database Reference
In-Depth Information
systems in a similar zone in the learning curve [11],[19]. We can not forget that
developing too much a classification tree leads to a greater probability of overtraining.
The validation methodology used in this experimentation has been to execute 5 times
a 10-fold stratified cross validation [11]. In each of the folds of the cross-validation
100 stratified subsamples have been extracted, always without replacement and with
size of 75% of the training sample in the corresponding fold. These subsamples have
been used to build both kinds of trees, CT and C4.5.
For CTC algorithm the subsamples have been used disjointedly to build the trees,
which has led to different number of instances of CTs when varying the
Number_Samples (N_S) parameter: N_S = 5 (20 trees), N_S = 10 (10 trees), N_S = 20
(5 trees), N_S = 30 (3 trees), N_S = 40 (2 trees) and N_S = 50 (2 trees). This means
that for each fold, 42 Consolidated Trees have been built.
For C4.5 algorithm different options have been tried:
C4.5 100 consists on building a tree with each one of the 100 subsamples mentioned
before, generated undersampling the training set (fold). The amount of information
of the original training set used by each algortihm is different in this case: a CT
sees more information than a C4.5 tree, which can lead to differences in accuracy.
This has led us to design another comparison, where both algorithms use the same
information (C4.5 union ).
The sample used to induce each one of the C4.5 union trees will be the union of the
subsamples used to build the corresponding CT. So, in this experimentation the
information handled by both algorithms is the same. In this case as many C4.5
trees as CTs are built.
Related to the previous one we made a third comparison among C4.5 and CTC
algorithm where the C4.5 trees have been built directly from the training data
belonging to each fold of the 10-fold cross-validation (C4.5 not resampling ). We can not
forget that this case can not be used when resampling is required. However we
think the comparison is interesting to appreciate correctly the achieved error rates.
The number of C4.5 trees generated is larger than the number of CT trees. We have
generated 100 C4.5 100 trees, 42 C4.5 union trees (same amount that CT trees) and one
C4.5 not resampling in each fold.
With this information we can quantify the number of trees generated for the wide
experimentation described in this section. For each of the 20 databases, 5 runs of 10
folds have been generated, so, for CTC algorithm, 42,000 trees have been built, and
for C4.5 algorithm, 100,000 (C4.5 100 ) + 42,000 (C4.5 union ) + 100 (C4.5 not resampling ).=
142,100 trees.
4 Summary of Previous Work
This section is devoted to present the results of different comparisons made among
the two algorithms (C4.5 and CTC).
The analysis has been made from two points of view: error and structural stability.
In order to evaluate the structural stability, a structural distance among the trees that
are being compared has been defined: Common . This structural measure is based on a
pair to pair comparison, Similarity, among all the trees of the set . This function
Search WWH ::




Custom Search