Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

−

computed by 1

r ij ). These outputs are represented by a score matrix R:

⎛

⎞

−

r 12

···

r 1 M

⎝

⎠

r 21

− ···

r 2 M

R

=

(5.3)

.

r M 1 r M 2 ··· −

.

The final output is derived from the score matrix by different aggregation models.

The most commonly used and simplest combination, also considered in the experi-

ments of this topic, is the application of a voting strategy:

Class

=

argmax i = 1 ,..., M

s ij

(5.4)

1

≤ j = i ≤ M

where s ij is 1 if r ij >

r ji and 0 otherwise. Therefore, the class with the largest number

of votes will be predicted. This strategy has proved to be competitive with different

classifiers obtaining similar results in comparison with more complex strategies [ 21 ].

5.5 Empirical Analysis of Noise Filters and Robust Strategies

In this section we want to illustrate the advantages of the noise approaches described

above.

5.5.1 Noise Introduction

In the data sets we are going to use (taken from Chap. 2 ) , as in most of the real-world

data sets, the initial amount and type of noise present is unknown. Therefore, no

assumptions about the base noise type and level can be made. For this reason, these

data sets are considered to be noise free, in the sense that no recognizable noise has

been introduced. In order to control the amount of noise in each data set and check

how it affects the classifiers, noise is introduced into each data set in a supervised

manner. Four different noise schemes proposed in the literature, as explained in

Sect. 5.2 , are used in order to introduce a noise level x% into each data set:

1. Introduction of class noise.

•

Uniform class noise [ 84 ] x% of the examples are corrupted. The class labels

of these examples are randomly replaced by another one from the M classes.

•

Pairwise class noise [ 100 , 102 ]Let X be the majority class and Y the second

majority class, an example with the label X has a probability of x

/

100 of

being incorrectly labeled as Y .

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home