Consolidated Trees: An Analysis of Structural Convergence - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

subsamples that when compared to the trees built with 20 subsamples all together, they

are even more similar than the trees built with 20 subsamples among them.

On the other hand, it can be observed that the trees built using CTC have a larger

common structure than the rest. In average we can say that for any value of

Number_Samples , CTC results are better than C4.5 union results in at least 10%. In the

case of C4.5 100 , the behaviour is much worse. Besides, being the values of CTC

“range” larger than values of CTC, we can assert that independently of the value used

for Number_Samples parameter , similar structures are reached, so, we can say that

even if different subsamples are used to build trees, the obtained structures are

similar. This makes the explanation of the classification steady when varying the

Number_Samples parameter. If we look to the graphics in Fig. 2 it seems that for

Breast-W database, when Number_Samples is greater than 40 all the trees are

identical. This does not happen in all databases but looking to the tendencies of the

average (Fig. 3), we could think that it will exist for each database a value of

Number_Samples with the same properties.

The data in Table 3 has given us the idea of studying the number of folds ( #folds )

where all the trees converge exactly to the same tree for the different values of

Number_Samples . Centring the analysis in CTC, we can differentiate three kinds of

behaviours (clusters) among the analysed databases: domains where for the majority

of folds ( #folds ≥ 25, since the total number of folds is 50) all the trees converge to

the same one (Cluster1: Breast-W, Hypo, Lymph, Iris, Voting, Breast-Y ), domains

with an intermediate number of folds that converge to the same tree (Cluster2: Heart-

C, Hepatitis, Soybean-L, Heart-H, Sick-E, Credit-A ), and domains where for the

analysed values of Number_Samples this situation never happens (Cluster3: Credit-G,

Segment210, Glass, Liver, Vehicle, Segment2310, Spam, Faithful ). This division

shows that even if CTC algorithm seems to converge for all the databases, the number

of samples needed to converge is domain dependent.

Table 4 shows the results of the mentioned analysis for CTC and C4.5 union .

Table 4. Analysis of converging folds ( #folds ) and %Common (%Com) for CTC and C4.5 union

for different values of Number_Samples ( N_S )

CTC

C4.5 union

N_S

Cluster1

1.83

8.00 21.00

31.00

36.00

38.00

0.00

0.50

1.00 4.67 4.83

0.00 0.00 0.50

1.67 5.17 5.50

0.00

0.00 0.50 0.83

Cluster2

Cluster3

0.00 0.00 0.00

0.00

0.00 0.00 0.00

Cluster1

68.33 76.36 82.01 87.44

88.16

91.08 52.61 60.37 68.37 70.39 72.10 73.87

Cluster2

36.94 44.06 52.99

58.63 61.70

64.08 25.68 28.96 35.29 38.64 42.14 44.28

Cluster3

19.81 23.36 27.38

30.16 31.16

33.14 16.21 20.31 25.79 28.37 32.08 32.59

When trying to understand the values in the upper part of Table 4 (#folds ), it has to

be taken into account that we use very hard conditions to count an unity: all the trees

built for a certain value of Number_Samples have to be identical. For example, if we

look to the data for Breast-W database in Fig.2, (results belong to 1 run 10 folds),

when N_S = 20 the values of %Common in the 10 folds are: 35.71; 80.00; 90.91; and

for the remaining seven 100.00. This means that in seven folds all the compared trees

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home