Dealing with Noisy Data - Data Preprocessing in Data Mining - page 128

Graphics Reference

In-Depth Information

binarization along their data complexity measures. Thus the filtering efficacy is

compared by using 1-NN as a classifier to obtain the accuracy of filtering versus

not filtering. This comparison is achieved by using a Wilcoxon Signed Rank test. If

the statistical test yields differences favouring the filtering, the two-class data set is

labeled as appropriate for filtering, and not favorable in other case. As a result for

each binary data set we will have 12 data complexity measures and a label describing

whether the data set is eligible for filtering or not. A simple way to summarize this

information into a rule set is to use a decision tree (C4.5) using the 12 data complexity

values as the input features, and the appropriateness label as the class.

An important appreciation about the scheme presented in Fig. 5.5 is that for every

label noise filter we want to consider, we will obtain a different set of rules. For

the sake of simplicity we will limit this illustrative study to our selected filters—EF,

CVCF and IPF—in Sect. 5.3 .

How accurate is the set of rules when predicting the suitability of label noise

filters? Using a 10-FCV over the data set obtained in the fourth step in Fig. 5.5 ,the

training and test accuracy of C4.5 for each filter is summarized in Table 5.3 .

The test accuracy above 80% in all cases indicates that the description obtained

by C4.5 is precise enough.

Using a decision tree is also interesting not only due to the generated rule set,

but also because we can check which data complexity measures (that is, the input

attributes) are selected first, and thus are considered as more important and discrim-

inant by C4.5. Averaging the rank of selection of each data complexity measure

over the 10 folds, Table 5.4 shows which complexity measures are the most dis-

Table 5.3 C4.5 accuracy in

training and test for the

ruleset describing the

adequacy of label noise filters

Noise filter

% Acc. training

% Acc. Test

EF

0.9948

0.8176

CVCF

0.9966

0.8353

IPF

0.9973

0.8670

Table 5.4 Average rank of

each data complexity measure

selected by C4.5 (the lower

the better)

Metric

EF

CVCF

IPF

Mean

F1

5.90

4.80

4.50

5.07

F2

1.00

1.00

1.00

1.00

F3

10.10

3.40

3.30

5.60

N1

9.10

9.90

7.10

8.70

N2

3.30

2.00

3.00

2.77

N3

7.80

8.50

9.50

8.60

N4

9.90

9.70

10.50

10.03

L1

7.90

10.00

6.00

7.97

L2

9.30

6.80

10.00

8.70

L3

4.60

8.70

5.90

6.40

T1

5.20

6.80

11.00

7.67

Next Page

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home