Dealing with Noisy Data - Data Preprocessing in Data Mining - page 109

Graphics Reference

In-Depth Information

(a)

(b)

(c)

Fig. 5.3 Statistical taxonomy of label noise as described in [ 19 ]. a Noisy completely at random

(NCAR), b Noisy at random (NAR), and c Noisy not at random (NNAR). X is the array of input

attributes, Y is the true class label,

Y is the actual class label and E indicates whether a labeling

= Y ). Arrows indicate statistical dependencies

error occurred ( Y

in Sect. 4.2 . That is, we will distinguish between three possible statistical models

for label noise as depicted in Fig. 5.3 . In the three subfigures of Fig. 5.3 the dashed

arrow points out a the implicit relation between the input features and the output that

is desired to be modeled by the classifier. In the most simplistic case in which the

noise procedure is not dependent of either the true value of the class Y or the input

attribute values X , the label noise is called noise completely at random or NCAR as

shown in Fig. 5.3 a. In [ 7 ] the observed label is different from the true class with a

probability p n =

(

=

)

, that is also called the error rate or noise rate. In binary

classification problems, the labeling error in NCAR is applied symmetrically to both

class labels and when p n =

P

E

1

5 the labels will no longer provide useful information.

In multiclass problems when the error caused by noise (i.e. E

0

.

1) appears the

class label is changed by any other different one available. In the case in which the

selection of the erroneous class label is made by a uniform probability distribution,

the noise model is known as uniform label/class noise .

Things get more complicated in the noise at random (NAR) model. Although

the noise is independent of the inputs X , the true value of the class make it more

or less prone to be noisy. This asymmetric labeling error can be produced by the

different cost of extracting the true class, as for example in medical case-control

studies, financial score assets and so on. Since the wrong class label is subject to a

particular true class label, the labeling probabilities can be defined as:

=

( Y

( Y

=ˆ

|

=

) =

=ˆ

|

=

,

=

)

(

=

|

=

).

P

y

Y

y

P

y

E

e

Y

y

P

E

e

Y

y

(5.1)

e

∈

0

,

1

Of course this probability definition span over all the class labels and the possibly

erroneous class that the could take. As shown in [ 70 ] this conforms a transition

matrix

( Y

for

the possible class labels c i and c j . Some examples can be examined with detail in

[ 19 ]. The NCAR model is a special case of the NAR label noise model in which

the probability of each position

γ

where each position

γ ij shows the probability of P

=

c i |

Y

=

c j )

Y and Y :

γ ij denotes the independency between

( Y i ,

γ ij =

P

Y

=

c j )

.

Next Page

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home