Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Apart from the uniform class noise, the NAR label noise has widely studied in

the literature. An example is the pairwise label noise, where two selected class labels

are chosen to be labeled with the other with certain probability. In this pairwise label

noise (or pairwise class noise) only two positions of the

matrix are nonzero outside

of the diagonal. Another problem derived from the NAR noise model is that it is not

trivial to decide whether the class labels are useful or not.

The third and last noise model is the noisy not at random (NNAR), where the input

attributes somehow affect the probability of the class label being erroneous as shown

in Fig. 5.3 c. An example of this illustrated by Klebanov [ 49 ] where evidence is given

that difficult samples are randomly labeled.It also occurs that those examples similar

to existing ones are labeled by experts in a biased way, having more probability of

being mislabeled the more similar they are. NNAR model is the more general case

of class noise [ 59 ] where the error E depends on both X and Y and it is the only

model able to characterize mislabelings in the class borders or due to poor sampling

density. As shown in [ 19 ] the probability of error is much more complex than in the

two previous cases as it has to take into account the density function of the input over

the input feature space

when continuous:

p n =

(

) =

C ×

(

)

(

)

c i ∈

x ∈X

(5.2)

As a consequence the perfect identification and estimation of the NNAR noise is

almost impossible, relying in approximating it from the expert knowledge of the

problem and the domain.

In the case of attribute noise, the modelization described above can be extended

and adapted. In this case, we can distinguish three possibilities as well:

•

When the noise appearance does not depend either on the rest of the input features'

values or the class label the NCAR noise model applies. This type of noise can

occur when distortions in the measures appear at random, for example in faulty

hand data insertion or network errors that do not depend in the data content itself.

•

When the attribute noise depends on the true value x i but not on the rest of input

values x 1 ,...,

x n or the observed class label y the NAR model is

applicable. An illustrative example is when the different temperatures affect their

registration in climatic data in a different way depending on the proper temperature

value.

x i − 1 ,

x i + 1 ,...,

•

In the last case the noise probability will depend on the value of the feature x i

but also on the rest of the input feature values x 1 ,...,

x n .Thisis

a very complex situation in which the value is altered when the rest of features

present a particular combination of values, as in medical diagnosis when some test

results are filled by an expert prediction without conducting the test due to high

costs.

For the sake of brevitywe will not develop the probability error equations here as their

expressions would vary depending on the nature of the input feature, being different

x i − 1 ,

x i + 1 ,...,

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home