Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

2. Introduction of attribute noise

•

Uniform attribute noise [ 100 , 104 ] x% of the values of each attribute in the

data set are corrupted. To corrupt each attribute A i , x% of the examples in

the data set are chosen, and their A i value is assigned a random value from

the domain

D i of the attribute A i . An uniform distribution is used either for

numerical or nominal attributes.

•

Gaussian attribute noise This scheme is similar to the uniformattribute noise,

but in this case, the A i values are corrupted, adding a random value to them

following Gaussian distribution of mean

0 and standard deviation =( max-

min ) /5 , being max and min the limits of the attribute domain (

=

D i ). Nominal

attributes are treated as in the case of the uniform attribute noise.

In order to create a noisy data set from the original, the noise is introduced into

the training partitions as follows:

1. A level of noise x %, of either class noise (uniform or pairwise) or attribute noise

(uniform or Gaussian), is introduced into a copy of the full original data set.

2. Both data sets, the original and the noisy copy, are partitioned into 5 equal folds,

that is, with the same examples in each one.

3. The training partitions are built from the noisy copy, whereas the test partitions

are formed from examples from the base data set, that is, the noise free data set.

We introduce noise, either class or attribute noise, only into the training sets

since we want to focus on the effects of noise on the training process. This will be

carried out observing how the classifiers built from different noisy training data for a

particular data set behave, considering the accuracy of those classifiers, with the same

clean test data. Thus, the accuracy of the classifier built over the original training

set without additional noise acts as a reference value that can be directly compared

with the accuracy of each classifier obtained with the different noisy training data.

Corrupting the test sets also affects the accuracy obtained by the classifiers and

therefore, our conclusions will not only be limited to the effects of noise on the

training process.

The accuracy estimation of the classifiers in a data set is obtained by means of 5

runs of a stratified 5-FCV. Hence, a total of 25 runs per data set, noise type and level

are averaged. 5 partitions are used because, if each partition has a large number of

examples, the noise effects will be more notable, facilitating their analysis.

The robustness of each method is estimated with the relative loss of accuracy

(RLA) (Eq. 5.5 ), which is used to measure the percentage of variation of the accuracy

of the classifiers at a concrete noise level with respect to the original case with no

additional noise:

Acc 0%

Acc x %

Acc 0%

−

(5.5)

RLA x % =

,

where RLA x % is the relative loss of accuracy at a noise level x %, Acc 0% is the

test accuracy in the original case, that is, with 0% of induced noise, and Acc x % is

the test accuracy with a noise level x %.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home