Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Whereas the former can be used to simulate a NCAR noise model, the latter is

useful to produce a particular NAR noise model.

2. Attribute noise can proceed from several sources, such as transmission con-

straints, faults in sensor devices, irregularities in sampling and transcription errors

[ 85 ]. The erroneous attribute values can be totally unpredictable, i.e., random,

or imply a low variation with respect to the correct value. We use the uniform

attribute noise scheme [ 100 , 104 ] and the Gaussian attribute noise scheme in

order to simulate each one of the possibilities, respectively. We introduce attribute

noise in accordance with the hypothesis that interactions between attributes are

weak [ 100 ]; as a consequence, the noise introduced into each attribute has a low

correlation with the noise introduced into the rest.

Robustness is the capability of an algorithm to build models that are insensitive to

data corruptions and suffer less from the impact of noise [ 39 ]. Thus, a classification

algorithm is said to be more robust than another if the former builds classifiers which

are less influenced by noise than the latter, i.e., more robust. In order to analyze the

degree of robustness of the classifiers in the presence of noise, we will compare the

performance of the classifiers learned with the original (without induced noise) data

set with the performance of the classifiers learned using the noisy data set. Therefore,

those classifiers learned from noisy data sets being more similar (in terms of results)

to the noise free classifiers will be the most robust ones.

5.3 Noise Filtering at Data Level

Noise filters are preprocessing mechanisms to detect and eliminate noisy instances in

the training set. The result of noise elimination in preprocessing is a reduced training

set which is used as an input to a classification algorithm. The separation of noise

detection and learning has the advantage that noisy instances do not influence the

classifier building design [ 24 ].

Noise filters are generally oriented to detect and eliminate instances with class

noise from the training data. Elimination of such instances has been shown to be

advantageous [ 23 ]. However, the elimination of instances with attribute noise seems

counterproductive [ 74 , 100 ] since instances with attribute noise still contain valuable

information in other attributes which can help to build the classifier. It is also hard

to distinguish between noisy examples and true exceptions, and henceforth many

techniques have been proposed to deal with noisy data sets with different degrees of

success.

We will consider three noise filters designed to deal with mislabeled instances

as they are the most common and the most recent: the Ensemble Filter [ 11 ], the

Cross-Validated Committees Filter [ 89 ] and the Iterative-Partitioning Filter [ 48 ]. It

should be noted that these three methods are ensemble-based and vote-based filters.

A motivation for using ensembles for filtering is pointed out in [ 11 ]: when it is

assumed that some instances in the data have been mislabeled and that the label errors

Search WWH ::

Custom Search

Home