Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Noise is specially relevant in supervised problems, where it alters the relationship

between the informative features and the measure output. For this reason noise has

been specially studied in classification and regression where noise hinders the knowl-

edge extraction from the data and spoils the models obtained using that noisy data

when they are compared to the models learned from clean data from the same prob-

lem, which represent the real implicit knowledge of the problem [ 100 ]. In this sense,

robustness [ 39 ] is the capability of an algorithm to build models that are insensitive

to data corruptions and suffer less from the impact of noise; that is, the more robust

an algorithm is, the more similar the models built from clean and noisy data are.

Thus, a classification algorithm is said to be more robust than another if the former

builds classifiers which are less influenced by noise than the latter. Robustness is

considered more important than performance results when dealing with noisy data,

because it allows one to know a priori the expected behavior of a learning method

against noise in cases where the characteristics of noise are unknown.

Several approaches have been studied in the literature to deal with noisy data

and to obtain higher classification accuracies on test data. Among them, the most

important are:

•

Robust learners [ 8 , 75 ] These are techniques characterized by being less influenced

by noisy data. An example of a robust learner is the C4.5 algorithm [ 75 ]. C4.5 uses

pruning strategies to reduce the chances reduce the possibility that trees overfit to

noise in the training data [ 74 ]. However, if the noise level is relatively high, even

a robust learner may have a poor performance.

•

Data polishing methods [ 84 ] Their aim is to correct noisy instances prior to training

a learner. This option is only viable when data sets are small because it is generally

time consuming. Several works [ 84 , 100 ] claim that complete or partial noise

correction in training data, with test data still containing noise, improves test

performance results in comparison with no preprocessing.

•

Noise filters [ 11 , 48 , 89 ] identify noisy instances which can be eliminated from

the training data. These are used with many learners that are sensitive to noisy data

and require data preprocessing to address the problem.

Noise is not the only problem that supervised ML techniques have to deal with.

Complex and nonlinear boundaries between classes are problems that may hinder the

performance of classifiers and it often is hard to distinguish between such overlapping

and the presence of noisy examples. This topic has attracted recent attention with the

appearance of works that have indicated relevant issues related to the degradation of

performance:

•

Presence of small disjuncts [ 41 , 43 ] (Fig. 5.1 a) The minority class can be decom-

posed into many sub-clusters with very few examples in each one, being sur-

rounded by majority class examples. This is a source of difficulty for most learning

algorithms in detecting precisely enough those sub-concepts.

•

Overlapping between classes [ 26 , 27 ] (Fig. 5.1 b) There are often some examples

from different classes with very similar characteristics, in particular if they are

located in the regions around decision boundaries between classes. These examples

refer to overlapping regions of classes.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home