Graphics Reference
In-Depth Information
Chapter 5
Dealing with Noisy Data
Abstract This chapter focuses on the noise imperfections of the data. The presence
of noise in data is a common problem that produces several negative consequences
in classification problems. Noise is an unavoidable problem, which affects the data
collection and data preparation processes in Data Mining applications, where errors
commonly occur. The performance of the models built under such circumstances
will heavily depend on the quality of the training data, but also on the robustness
against the noise of the model learner itself. Hence, problems containing noise are
complex problems and accurate solutions are often difficult to achieve without using
specialized techniques—particularly if they are noise-sensitive. Identifying the noise
is a complex task that will be developed in Sect. 5.1 . Once the noise has been identi-
fied, the different kinds of such an imperfection are described in Sect. 5.2 .Fromthis
point on, the two main approaches carried out in the literature are described. On the
first hand, modifying and cleaning the data is studied in Sect. 5.3 , whereas design-
ing noise robust Machine Learning algorithms is tackled in Sect. 5.4 . An empirical
comparison between the latest approaches in the specialized literature is made in
Sect. 5.5 .
5.1 Identifying Noise
Real-world data is never perfect and often suffers from corruptions that may harm
interpretations of the data, models built and decisions made. In classification, noise
can negatively affect the system performance in terms of classification accuracy,
building time, size and interpretability of the classifier built [ 99 , 100 ]. The presence
of noise in the data may affect the intrinsic characteristics of a classification prob-
lem. Noise may create small clusters of instances of a particular class in parts of the
instance space corresponding to another class, remove instances located in key areas
within a concrete class or disrupt the boundaries of the classes and increase over-
lapping among them. These alterations corrupt the knowledge that can be extracted
from the problem and spoil the classifiers built from that noisy data with respect
to the original classifiers built from the clean data that represent the most accurate
implicit knowledge of the problem [ 100 ].
 
Search WWH ::




Custom Search