Database Reference
In-Depth Information
However, from a practical point of view, the NN algorithm is not suited to very
large data sets because of the storage requirements it imposes and the
computational overhead involved. Actually, this approach involves storing all the
instances in memory. Pioneer work in IS firstly tired to reduce the storing size.
Next, we introduce a brief historical review about this topic. The algorithms used
in this study will be described in Section 5.3.
Hart [17] proposes a condensed NN rule to find a consistent subset, which
correctly classifies all of the remaining points in the sample set. However, this
algorithm will not find a minimal consistent subset [39].
The reduced NN rule proposed by Gates [15] tries to overcome this drawback,
searching in Hart's consistent set for the minimal subset that correctly classifies all
the learning instances. However, this approach is efficient if and only if Hart's
consistent set contains the minimal consistent set of the learning set, which is not
always the case. It is worthwhile remarking that in these two approaches, the IS
algorithm deduces only one instance subset. It is impossible to save more or fewer
instances and to control the size of the subset.
Wilson [38] proposes edited NN and repeated edited NN . Edited NN edits out
noisy instances and close border cases, leaving smoother decision boundaries. It
also retains all internal points, which keeps it from reducing the storage
requirements as much as most other reduction algorithms. The repeated edited NN
continues to widen the gap between classes and to smooth the decision boundary.
Kibbler and Aha [20] propose an algorithm similar to the reduced NN. It retains
border points, but unlike reduced NN, it is sensitive to noise. It is called the shrink
algorithm .
Brodley [7] proposes the MCS system (model class selection system) to deal
with the IS problem. MCS systems tend to avoid noise.
Wilson and Martinez [39] present a family of editing algorithms guided by the
sets for each instance: the k nearest neighbors and the associates of the instance.
An associate of the instance P is each of the instances that has P as one of its k
nearest neighbors. The family is composed of five algorithms, from DROP1 to
DROP5 . The first three methods, DROP1 - 3 , were previously introduced by the
authors under the name RT1-3, respectively. These algorithms pretend to produce
instance reductions that provide noise tolerance, high-generalization accuracy,
insensitivity to the order of presentation of instances, and significant storage
reduction, which in turn improves the generalization and speed.
5.2.2 Strategies for Instance Selection
In this section, we describe two strategies for IS followed in the study presented in
this chapter.
Search WWH ::




Custom Search