Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

5.3.2 Cross-Validated Committees Filter

The Cross-Validated Committees Filter (CVCF) [ 89 ] uses ensemblemethods in order

to preprocess the training set to identify and remove mislabeled instances in classi-

fication data sets. CVCF is mainly based on performing an

-FCV to split the full

training data and on building classifiers using decision trees in each training subset.

The authors of CVCF place special emphasis on using ensembles of decision trees

such as C4.5 because they think that this kind of algorithm works well as a filter for

noisy data.

The basic steps of CVCF are the following:

Γ

•

Split the training data set D T into

Γ

equal sized subsets.

•

For each of these

Γ

parts, a base learning algorithm is trained on the other

Γ −

1

parts. This results in

different classifiers. We use C4.5 as base learning algorithm

in our experimentation as recommended by the authors.

Γ

•

These

resulting classifiers are then used to tag each instance in the training

set D T as either correct or mislabeled, by comparing the training label with that

assigned by the classifier.

Γ

•

Add to D N the noisy instances identified in D T using a voting scheme (themajority

scheme in our experimentation), taking into account the correctness of the labels

obtained in the previous step by the

Γ

classifier built.

•

Remove the noisy instances from the training set: D T

←

D T \

D N .

5.3.3 Iterative-Partitioning Filter

The Iterative-Partitioning Filter (IPF) [ 48 ] is a preprocessing technique based on

the Partitioning Filter [ 102 ]. It is employed to identify and eliminate mislabeled

instances in large data sets. Most noise filters assume that data sets are relatively

small and capable of being learned after only one time, but this is not always true

and partitioning procedures may be necessary.

IPF removes noisy instances in multiple iterations until a stopping criterion is

reached. The iterative process stops if, for a number of consecutive iterations s ,

the number of identified noisy instances in each of these iterations is less than a

percentage p of the size of the original training data set. Initially, we have a set of

noisy instances D N

=∅

and a set of good data D G

=∅

. The basic steps of each

iteration are:

•

Split the training data set D T into

equal sized subsets. Each of these is small

enough to be processed by an induction algorithm once.

Γ

•

For each of these

Γ

parts, a base learning algorithm is trained on this part. This

results in

different classifiers. We use C4.5 as the base learning algorithm in our

experimentation as recommended by the authors.

Γ

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home