Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

from real valued ones with respect to nominal attributes. However we must point out

that in attribute noise the probability dependencies are not the only important aspect

to be considered. The probability distribution of the noise is also fundamental.

For numerical data, the noisy datum

x i may be a slight variation of the true value x i

or a completely random value. The density function of the noise values is very rarely

known. Simple examples of the first type of noise would be perturbations caused by a

normal distribution with the mean centered in the true value and with a fixed variance.

The second type of noise is usually estimated by assigning an uniform probability

to all the possible values of the input feature's range. This procedure is also typical

with nominal data, where no preference of one value is taken. Again note that the

distribution of the noise is not the same as the probability of its appearance discussed

above: first the noise must be introduced with a certain probability (following the

NCAR, NAR or NNAR models) and then the noise value is stated or analyzed to

follow the aforementioned density functions.

ˆ

5.2.2 Simulating the Noise of Real-World Data Sets

Checking the effect of noisy data on the performance of classifier learning algorithms

is necessary to improve their reliability and hasmotivated the study of how to generate

and introduce noise into the data. Noise generation can be characterized by threemain

characteristics [ 100 ]:

1. The place where the noise is introduced Noise may affect the input attributes

or the output class, impairing the learning process and the resulting model.

2. The noise distribution The way in which the noise is present can be, for example,

uniform [ 84 , 104 ] or Gaussian [ 100 , 102 ].

3. The magnitude of generated noise values The extent to which the noise affects

the data set can be relative to each data value of each attribute, or relative to the

minimum, maximum and standard deviation for each attribute [ 100 , 102 , 104 ].

In contrast to other studies in the literature, this topic aims to clearly explain

how noise is defined and generated, and also to properly justify the choice of the

noise introduction schemes. Furthermore, the noise generation software has been

incorporated into the KEEL tool (see Chap. 10 ) for its free usage. The two types of

noise considered in this work, class and attribute noise, have been modeled using

four different noise schemes; in such a way that, the presence of these types of noise

will allow one to simulate the behavior of the classifiers in these two scenarios:

1. Class noise usually occurs on the boundaries of the classes, where the examples

may have similar characteristics—although it can occur in any other area of

the domain. In this topic, class noise is introduced using an uniform class noise

scheme [ 84 ] (randomly corrupting the class labels of the examples) and a pairwise

class noise scheme [ 100 , 102 ] (labeling examples of the majority class with the

second majority class). Considering these two schemes, noise affecting any class

label and noise affecting only the two majority classes is simulated respectively.

Search WWH ::

Custom Search

Home