Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Chapter 5

Dealing with Noisy Data

Abstract This chapter focuses on the noise imperfections of the data. The presence

of noise in data is a common problem that produces several negative consequences

in classification problems. Noise is an unavoidable problem, which affects the data

collection and data preparation processes in Data Mining applications, where errors

commonly occur. The performance of the models built under such circumstances

will heavily depend on the quality of the training data, but also on the robustness

against the noise of the model learner itself. Hence, problems containing noise are

complex problems and accurate solutions are often difficult to achieve without using

specialized techniques—particularly if they are noise-sensitive. Identifying the noise

is a complex task that will be developed in Sect. 5.1 . Once the noise has been identi-

fied, the different kinds of such an imperfection are described in Sect. 5.2 .Fromthis

point on, the two main approaches carried out in the literature are described. On the

first hand, modifying and cleaning the data is studied in Sect. 5.3 , whereas design-

ing noise robust Machine Learning algorithms is tackled in Sect. 5.4 . An empirical

comparison between the latest approaches in the specialized literature is made in

Sect. 5.5 .

5.1 Identifying Noise

Real-world data is never perfect and often suffers from corruptions that may harm

interpretations of the data, models built and decisions made. In classification, noise

can negatively affect the system performance in terms of classification accuracy,

building time, size and interpretability of the classifier built [ 99 , 100 ]. The presence

of noise in the data may affect the intrinsic characteristics of a classification prob-

lem. Noise may create small clusters of instances of a particular class in parts of the

instance space corresponding to another class, remove instances located in key areas

within a concrete class or disrupt the boundaries of the classes and increase over-

lapping among them. These alterations corrupt the knowledge that can be extracted

from the problem and spoil the classifiers built from that noisy data with respect

to the original classifiers built from the clean data that represent the most accurate

implicit knowledge of the problem [ 100 ].

Search WWH ::

Custom Search

Home