Graphics Reference
In-Depth Information
binarization along their data complexity measures. Thus the filtering efficacy is
compared by using 1-NN as a classifier to obtain the accuracy of filtering versus
not filtering. This comparison is achieved by using a Wilcoxon Signed Rank test. If
the statistical test yields differences favouring the filtering, the two-class data set is
labeled as appropriate for filtering, and not favorable in other case. As a result for
each binary data set we will have 12 data complexity measures and a label describing
whether the data set is eligible for filtering or not. A simple way to summarize this
information into a rule set is to use a decision tree (C4.5) using the 12 data complexity
values as the input features, and the appropriateness label as the class.
An important appreciation about the scheme presented in Fig. 5.5 is that for every
label noise filter we want to consider, we will obtain a different set of rules. For
the sake of simplicity we will limit this illustrative study to our selected filters—EF,
CVCF and IPF—in Sect. 5.3 .
How accurate is the set of rules when predicting the suitability of label noise
filters? Using a 10-FCV over the data set obtained in the fourth step in Fig. 5.5 ,the
training and test accuracy of C4.5 for each filter is summarized in Table 5.3 .
The test accuracy above 80% in all cases indicates that the description obtained
by C4.5 is precise enough.
Using a decision tree is also interesting not only due to the generated rule set,
but also because we can check which data complexity measures (that is, the input
attributes) are selected first, and thus are considered as more important and discrim-
inant by C4.5. Averaging the rank of selection of each data complexity measure
over the 10 folds, Table 5.4 shows which complexity measures are the most dis-
Table 5.3 C4.5 accuracy in
training and test for the
ruleset describing the
adequacy of label noise filters
Noise filter
% Acc. training
% Acc. Test
EF
0.9948
0.8176
CVCF
0.9966
0.8353
IPF
0.9973
0.8670
Table 5.4 Average rank of
each data complexity measure
selected by C4.5 (the lower
the better)
Metric
EF
CVCF
IPF
Mean
F1
5.90
4.80
4.50
5.07
F2
1.00
1.00
1.00
1.00
F3
10.10
3.40
3.30
5.60
N1
9.10
9.90
7.10
8.70
N2
3.30
2.00
3.00
2.77
N3
7.80
8.50
9.50
8.60
N4
9.90
9.70
10.50
10.03
L1
7.90
10.00
6.00
7.97
L2
9.30
6.80
10.00
8.70
L3
4.60
8.70
5.90
6.40
T1
5.20
6.80
11.00
7.67
 
 
Search WWH ::




Custom Search