Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

In this topic, class noise refers to misclassifications, whereas attribute noise refers

to erroneous attribute values, because they are the most common in real-world data

[ 100 ]. Furthermore, erroneous attribute values, unlike other types of attribute noise,

such as MVs (which are easily detectable), have received less attention in the litera-

ture.

Treating class and attribute noise as corruptions of the class labels and attribute

values, respectively, has been also considered in other works in the literature [ 69 ,

100 ]. For instance, in [ 100 ], the authors reached a series of interesting conclusions,

showing that attribute noise is more harmful than class noise or that eliminating

or correcting examples in data sets with class and attribute noise, respectively, may

improve classifier performance. They also showed that attribute noise ismore harmful

in those attributes highly correlated with the class labels. In [ 69 ], the authors checked

the robustness of methods from different paradigms, such as probabilistic classifiers,

decision trees, instance based learners or SVMs, studying the possible causes of their

behavior.

However, most of the works found in the literature are only focused on class

noise. In [ 9 ], the problem of multi-class classification in the presence of labeling

errors was studied. The authors proposed a generative multi-class classifier to learn

with labeling errors, which extends the multi-class quadratic normal discriminant

analysis by a model of the mislabeling process. They demonstrated the benefits

of this approach in terms of parameter recovery as well as improved classification

performance. In [ 32 ], the problems caused by labeling errors occurring far from

the decision boundaries in Multi-class Gaussian Process Classifiers were studied.

The authors proposed a Robust Multi-class Gaussian Process Classifier, introducing

binary latent variables that indicate when an example is mislabeled. Similarly, the

effect of mislabeled samples appearing in gene expression profiles was studied in

[ 98 ]. A detection method for these samples was proposed, which takes advantage of

the measuring effect of data perturbations based on the SVM regression model. They

also proposed three algorithms based on this index to detect mislabeled samples. An

important common characteristic of these works, also considered in this topic, is that

the suitability of the proposals was evaluated on both real-world and synthetic or

noisy-modified real-world data sets, where the noise could be somehow quantified.

In order to model class and attribute noise, we consider four different synthetic

noise schemes found in the literature, so that we can simulate the behavior of the

classifiers in the presence of noise as presented in the next section.

5.2.1 Noise Introduction Mechanisms

Traditionally the label noise introduction mechanism has not attracted as much atten-

tion in its consequences as it has in the knowledge extracted from it. However, as

the noise treatment is being embedded in the classifier design, the nature of noise

becomes more and more important. Recently, the authors in Frenay and Verley-

sen [ 19 ] have adopted the statistical analysis for the MVs introduction described

Search WWH ::

Custom Search

Home