Instance Selection Using Evolutionary Algorithms: An Experimental Study - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

However, from a practical point of view, the NN algorithm is not suited to very

large data sets because of the storage requirements it imposes and the

computational overhead involved. Actually, this approach involves storing all the

instances in memory. Pioneer work in IS firstly tired to reduce the storing size.

Next, we introduce a brief historical review about this topic. The algorithms used

in this study will be described in Section 5.3.

Hart [17] proposes a condensed NN rule to find a consistent subset, which

correctly classifies all of the remaining points in the sample set. However, this

algorithm will not find a minimal consistent subset [39].

The reduced NN rule proposed by Gates [15] tries to overcome this drawback,

searching in Hart's consistent set for the minimal subset that correctly classifies all

the learning instances. However, this approach is efficient if and only if Hart's

consistent set contains the minimal consistent set of the learning set, which is not

always the case. It is worthwhile remarking that in these two approaches, the IS

algorithm deduces only one instance subset. It is impossible to save more or fewer

instances and to control the size of the subset.

Wilson [38] proposes edited NN and repeated edited NN . Edited NN edits out

noisy instances and close border cases, leaving smoother decision boundaries. It

also retains all internal points, which keeps it from reducing the storage

requirements as much as most other reduction algorithms. The repeated edited NN

continues to widen the gap between classes and to smooth the decision boundary.

Kibbler and Aha [20] propose an algorithm similar to the reduced NN. It retains

border points, but unlike reduced NN, it is sensitive to noise. It is called the shrink

algorithm .

Brodley [7] proposes the MCS system (model class selection system) to deal

with the IS problem. MCS systems tend to avoid noise.

Wilson and Martinez [39] present a family of editing algorithms guided by the

sets for each instance: the k nearest neighbors and the associates of the instance.

An associate of the instance P is each of the instances that has P as one of its k

nearest neighbors. The family is composed of five algorithms, from DROP1 to

DROP5 . The first three methods, DROP1 - 3 , were previously introduced by the

authors under the name RT1-3, respectively. These algorithms pretend to produce

instance reductions that provide noise tolerance, high-generalization accuracy,

insensitivity to the order of presentation of instances, and significant storage

reduction, which in turn improves the generalization and speed.

5.2.2 Strategies for Instance Selection

In this section, we describe two strategies for IS followed in the study presented in

this chapter.

Search WWH ::

Custom Search

Home