Information Technology Reference
In-Depth Information
Table 3.1. Databases Description
Databases
Nb. Inst
Attrib
Cl. Pred Miss.VaL
IRIS
150
4numeric
3
no
NHL
137
8 numeric and symbolic
2
yes
VOTE
435
16 boolean valued
2
yes
WEATHER
14
4 numeric and symbolic
2
no
CREDIT-A
690
16numeric and symbolic
2
yes
TITANIC
750
3 symbolic
2
no
DIABETES
768
8numeric
2
no
HYPOTHYROID
3772
30 numeric and symbolic
4
yes
HEPATITIS
155
19 numeric and symbolic
2
yes
CONTACT-LENSES
24
4 nominal
3
no
ZOO
101
18 numeric and boolean
7
no
STRAIGHT
320
2numeric
2
no
IDS
4950
35 numeric and symbolic
12
no
LYMPH
148
18 numeric
4
no
BREAST-CANCER
286
9 numeric and symbolic
2
yes
To do this experimental comparison, we used the C4.5 algorithm as a weak
learner (according to the study of Dietterich [6]). To estimate without skew
the theoretical success rate, we used a procedure of cross-validation in 10 folds
(according to the study [11]). In order to choose the databases for our ex-
periments, we considered the principle of diversity . We have considered 15
databases of the UCI. Some databases are characterized by theirs missing
values (NHL, Vote, Hepatitis, Hypothyroid). Some others concern the problem of
multi-class prediction (Iris: 3 classes, Diabetes: 4 classes, Zoo: 7 classes, IDS: 12
classes). We choose the IDS database [23] especially because it has 35 attributes.
Table 1 describes the 15 databases used in the experimental comparison.
3.4.1
Comparison of Generalization Error
Graphic 1 indicated the error rates in 10-fold cross-validation corresponding
to the algorithm AdaBoost M1,BrownBoost and the proposed one. We used the
same samples for the tree algorithms in cross-validation for comparison purposes.
The results are obtained while having chosen for each algorithm to carry out 20
iterations. The study of the effect of the number of iterations on the error rates
of the tree algorithms will be presented in the section 4.3, where we will consider
about 1000 iterations.
The results in graphic 1 show already that the proposed modifications improve
the error rates of AdaBoost. Indeed, for 14 databases out of 15, the proposed
algorithm shows an error rate lower or equal to AdaBoost M1. We remark, also,
a significant improvement of the error rates corresponding to the three databases
NHL, CONTACT-LENS and BREAST-CANCER. For example, the error rate
corresponding to the BREAST-CANCER database goes from 45.81% to 30.41%.
 
Search WWH ::




Custom Search