Graphics Reference
In-Depth Information
5.3.2 Cross-Validated Committees Filter
The Cross-Validated Committees Filter (CVCF) [ 89 ] uses ensemblemethods in order
to preprocess the training set to identify and remove mislabeled instances in classi-
fication data sets. CVCF is mainly based on performing an
-FCV to split the full
training data and on building classifiers using decision trees in each training subset.
The authors of CVCF place special emphasis on using ensembles of decision trees
such as C4.5 because they think that this kind of algorithm works well as a filter for
noisy data.
The basic steps of CVCF are the following:
Γ
Split the training data set D T into
Γ
equal sized subsets.
For each of these
Γ
parts, a base learning algorithm is trained on the other
Γ
1
parts. This results in
different classifiers. We use C4.5 as base learning algorithm
in our experimentation as recommended by the authors.
Γ
These
resulting classifiers are then used to tag each instance in the training
set D T as either correct or mislabeled, by comparing the training label with that
assigned by the classifier.
Γ
Add to D N the noisy instances identified in D T using a voting scheme (themajority
scheme in our experimentation), taking into account the correctness of the labels
obtained in the previous step by the
Γ
classifier built.
Remove the noisy instances from the training set: D T
D T \
D N .
5.3.3 Iterative-Partitioning Filter
The Iterative-Partitioning Filter (IPF) [ 48 ] is a preprocessing technique based on
the Partitioning Filter [ 102 ]. It is employed to identify and eliminate mislabeled
instances in large data sets. Most noise filters assume that data sets are relatively
small and capable of being learned after only one time, but this is not always true
and partitioning procedures may be necessary.
IPF removes noisy instances in multiple iterations until a stopping criterion is
reached. The iterative process stops if, for a number of consecutive iterations s ,
the number of identified noisy instances in each of these iterations is less than a
percentage p of the size of the original training data set. Initially, we have a set of
noisy instances D N
=∅
and a set of good data D G
=∅
. The basic steps of each
iteration are:
Split the training data set D T into
equal sized subsets. Each of these is small
enough to be processed by an induction algorithm once.
Γ
For each of these
Γ
parts, a base learning algorithm is trained on this part. This
results in
different classifiers. We use C4.5 as the base learning algorithm in our
experimentation as recommended by the authors.
Γ
 
Search WWH ::




Custom Search