Information Technology Reference
In-Depth Information
According to these criteria, we propose the following optimization model:
Minimize ER
(
x
)
(7)
Minimize
|
x
|
Note that the objectives in the optimization model (7) are contradictory since a lower
number of significant cases means a higher error rate and vice versa, that is the greater
the number of variables the smaller the error rate. The solution to model (7) is a set of
m
x k ,k
X non-dominated solutions C
= {
S
}
, S
= { 1
,...,X
}
, where each
solution x k of C represents the best collection of significant k cases.
From the practical point of view and in order to simplify the model, it is interesting
to sacrifice accuracy slightly when the number of cases are reduced significantly. In
Section 4 some examples are provided.
We propose NSGA-II [6] and SPEA-2 [33] Multiobjective Evolutionary Algorithms
to solve the problem.
4
Experiments and Results
In this section, we present a practical use of the methodology proposed. We evaluate
the case selection methods described in Section 3.3 using case memories of different
domains. In particular we consider standard datasets of the UCI repository 1 . Following
this methodology, we set f
=10
(that is, Cross-Validation with 10 folders) and K
=1
(i.e. 1-NN classifier).
Table 1 depicts a summary of the experiments. For each case memory (rows), the
best results are enhanced in boldface and the worst in italics.
In general, CNN and RNN achieve great size reductions in noise free case memories,
however they keep the error higher than the control methods in every case. In all the
experiments, ENN and All-KNN maintain or improve the error rate. If the case memory
has no noisy instances then the reduction is negligible, otherwise the reduction is clearly
significant. Method as IB2, IB3 and Shrink achieve great size reductions when they
select instances from a case memory with well defined boundaries, but they are too
weak to presence of noisy instance.
The evaluation of methods considering each dataset highlights the suitability of some
methods. For small datasets with high defined boundaries (Iris and Wine), SPEA-2 al-
gorithm seems to be the best approach since they reduce about 50% of the case memory
maintaining an acceptable error rate. For larger datasets and no clear boundaries (Yeast
or Breast Cancer datasets), DROP algorithms also obtain an effective reduction of de
memory (approx. 80%), however ENN and RENN reach a solid reduction but min-
imises the error rate with a less time cost. Note that in the medical domain (e.g. Breast
Cancer dataset) other aspects must be taken into account. In this sense, NSGA-II seems
to be the most effective algorithm since reduces about 50% the case memory maintain-
ing the error rate, but also maintaing the kappa coefficient, specificity and sensitivity.
According to the experiments, ENN and RENN seem also useful for large datasets with
a high number of classes (such as Abalone), improving the accuracy of the system by
reducing 80% the case memory.
1
http://archive.ics.uci.edu/ml/
 
Search WWH ::




Custom Search