Graphics Reference
In-Depth Information
Noise is specially relevant in supervised problems, where it alters the relationship
between the informative features and the measure output. For this reason noise has
been specially studied in classification and regression where noise hinders the knowl-
edge extraction from the data and spoils the models obtained using that noisy data
when they are compared to the models learned from clean data from the same prob-
lem, which represent the real implicit knowledge of the problem [ 100 ]. In this sense,
robustness [ 39 ] is the capability of an algorithm to build models that are insensitive
to data corruptions and suffer less from the impact of noise; that is, the more robust
an algorithm is, the more similar the models built from clean and noisy data are.
Thus, a classification algorithm is said to be more robust than another if the former
builds classifiers which are less influenced by noise than the latter. Robustness is
considered more important than performance results when dealing with noisy data,
because it allows one to know a priori the expected behavior of a learning method
against noise in cases where the characteristics of noise are unknown.
Several approaches have been studied in the literature to deal with noisy data
and to obtain higher classification accuracies on test data. Among them, the most
important are:
Robust learners [ 8 , 75 ] These are techniques characterized by being less influenced
by noisy data. An example of a robust learner is the C4.5 algorithm [ 75 ]. C4.5 uses
pruning strategies to reduce the chances reduce the possibility that trees overfit to
noise in the training data [ 74 ]. However, if the noise level is relatively high, even
a robust learner may have a poor performance.
Data polishing methods [ 84 ] Their aim is to correct noisy instances prior to training
a learner. This option is only viable when data sets are small because it is generally
time consuming. Several works [ 84 , 100 ] claim that complete or partial noise
correction in training data, with test data still containing noise, improves test
performance results in comparison with no preprocessing.
Noise filters [ 11 , 48 , 89 ] identify noisy instances which can be eliminated from
the training data. These are used with many learners that are sensitive to noisy data
and require data preprocessing to address the problem.
Noise is not the only problem that supervised ML techniques have to deal with.
Complex and nonlinear boundaries between classes are problems that may hinder the
performance of classifiers and it often is hard to distinguish between such overlapping
and the presence of noisy examples. This topic has attracted recent attention with the
appearance of works that have indicated relevant issues related to the degradation of
performance:
Presence of small disjuncts [ 41 , 43 ] (Fig. 5.1 a) The minority class can be decom-
posed into many sub-clusters with very few examples in each one, being sur-
rounded by majority class examples. This is a source of difficulty for most learning
algorithms in detecting precisely enough those sub-concepts.
Overlapping between classes [ 26 , 27 ] (Fig. 5.1 b) There are often some examples
from different classes with very similar characteristics, in particular if they are
located in the regions around decision boundaries between classes. These examples
refer to overlapping regions of classes.
 
Search WWH ::




Custom Search