Database Reference
In-Depth Information
subsamples that when compared to the trees built with 20 subsamples all together, they
are even more similar than the trees built with 20 subsamples among them.
On the other hand, it can be observed that the trees built using CTC have a larger
common structure than the rest. In average we can say that for any value of
Number_Samples , CTC results are better than C4.5 union results in at least 10%. In the
case of C4.5 100 , the behaviour is much worse. Besides, being the values of CTC
“range” larger than values of CTC, we can assert that independently of the value used
for Number_Samples parameter , similar structures are reached, so, we can say that
even if different subsamples are used to build trees, the obtained structures are
similar. This makes the explanation of the classification steady when varying the
Number_Samples parameter. If we look to the graphics in Fig. 2 it seems that for
Breast-W database, when Number_Samples is greater than 40 all the trees are
identical. This does not happen in all databases but looking to the tendencies of the
average (Fig. 3), we could think that it will exist for each database a value of
Number_Samples with the same properties.
The data in Table 3 has given us the idea of studying the number of folds ( #folds )
where all the trees converge exactly to the same tree for the different values of
Number_Samples . Centring the analysis in CTC, we can differentiate three kinds of
behaviours (clusters) among the analysed databases: domains where for the majority
of folds ( #folds ≥ 25, since the total number of folds is 50) all the trees converge to
the same one (Cluster1: Breast-W, Hypo, Lymph, Iris, Voting, Breast-Y ), domains
with an intermediate number of folds that converge to the same tree (Cluster2: Heart-
C, Hepatitis, Soybean-L, Heart-H, Sick-E, Credit-A ), and domains where for the
analysed values of Number_Samples this situation never happens (Cluster3: Credit-G,
Segment210, Glass, Liver, Vehicle, Segment2310, Spam, Faithful ). This division
shows that even if CTC algorithm seems to converge for all the databases, the number
of samples needed to converge is domain dependent.
Table 4 shows the results of the mentioned analysis for CTC and C4.5 union .
Table 4. Analysis of converging folds ( #folds ) and %Common (%Com) for CTC and C4.5 union
for different values of Number_Samples ( N_S )
CTC
C4.5 union
N_S
5
10
20
30
40
50
5
10
20
30
40
50
Cluster1
1.83
8.00 21.00
31.00
36.00
38.00
0.00
0.00
0.50
1.00 4.67 4.83
0.00 0.00 0.50
1.67 5.17 5.50
0.00
0.00
0.00
0.00 0.50 0.83
Cluster2
Cluster3
0.00 0.00 0.00
0.00 0.00 0.00
0.00
0.00
0.00
0.00 0.00 0.00
Cluster1
68.33 76.36 82.01 87.44
88.16
91.08 52.61 60.37 68.37 70.39 72.10 73.87
Cluster2
36.94 44.06 52.99
58.63 61.70
64.08 25.68 28.96 35.29 38.64 42.14 44.28
Cluster3
19.81 23.36 27.38
30.16 31.16
33.14 16.21 20.31 25.79 28.37 32.08 32.59
When trying to understand the values in the upper part of Table 4 (#folds ), it has to
be taken into account that we use very hard conditions to count an unity: all the trees
built for a certain value of Number_Samples have to be identical. For example, if we
look to the data for Breast-W database in Fig.2, (results belong to 1 run 10 folds),
when N_S = 20 the values of %Common in the 10 folds are: 35.71; 80.00; 90.91; and
for the remaining seven 100.00. This means that in seven folds all the compared trees
Search WWH ::




Custom Search