Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

are independent of the particular model being fitted to the data, collecting information

from different models will provide a better method for detecting mislabeled instances

than collecting information from a single model.

The implementations of these three noise filters can be found in KEEL (see

Chap. 10 ) . Their descriptions can be found in the following subsections. In all descrip-

tions we use D T to refer to the training set, D N to refer to the noisy data identified

in the training set (initially, D N

=∅

) and

Γ

is the number of folds in which the

training data is partitioned by the noise filter.

The three noise filters presented below use a voting scheme to determine which

instances to eliminate from the training set. There are two possible schemes to deter-

mine which instances to remove: consensus and majority schemes. The consensus

scheme removes an instance if it is misclassified by all the classifiers, while the

majority scheme removes an instance if it is misclassified by more than half of the

classifiers. Consensus filters are characterized by being conservative in rejecting

good data at the expense of retaining bad data. Majority filters are better at detecting

bad data at the expense of rejecting good data.

5.3.1 Ensemble Filter

The Ensemble Filter (EF) [ 11 ] is a well-known filter in the literature. It attempts to

achieve an improvement in the quality of the training data as a preprocessing step

in classification, by detecting and eliminating mislabeled instances. It uses a set of

learning algorithms to create classifiers in several subsets of the training data that

serve as noise filters for the training set.

The identification of potentially noisy instances is carried out by performing an

Γ

classification algorithms, called filter algorithms.

In the developed experimentation for this topic we have utilized the three filter algo-

rithms used by the authors in [ 11 ], which are C4.5, 1-NNand LDA [ 63 ]. The complete

process carried out by EF is described below:

-FCV on the training data with

µ

•

Split the training data set D T into

Γ

equal sized subsets.

•

For each one of the

μ

filter algorithms:

- For each of these

Γ

parts, the filter algorithm is trained on the other

Γ −

1 parts.

This results in

Γ

different classifiers.

- These

resulting classifiers are then used to tag each instance in the excluded

part as either correct or mislabeled, by comparing the training label with that

assigned by the classifier.

Γ

•

At the end of the above process, each instance in the training data has been tagged

by each filter algorithm.

•

Add to D N the noisy instances identified in D T using a voting scheme, taking into

account the correctness of the labels obtained in the previous step by the

μ

filter

algorithms. We use a consensus vote scheme in this case.

←

D T \

•

Remove the noisy instances from the training set: D T

D N .

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home