Instance Selection Using Evolutionary Algorithms: An Experimental Study - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

5.2 Instance Selection

In IS we want to isolate the smallest set of instances that enable us to predict the

class of a query instance with the same (or higher) accuracy as the original set [26].

By reducing the “useful” data set size, which can reduce both space and time

complexities of subsequent processing phases. One can also hope to reduce the

size of formulas obtained by a subsequent induction algorithm on the reduced and

less noisy data sets. This may facilitate interpretation tasks.

IS raises the problem of defining relevance for a prototype subset. From the

statistical viewpoint, relevance can be partly understood as the contribution to the

overall accuracy, which would be obtained by a subsequent induction. We

emphasize that removing instances does not necessarily lead to a degradation of

the results: We have observed experimentally that a small number of instances can

have performances comparable to those of the whole sample, sometimes higher.

Two reasons come to mind to explain such an observation. First, some noises or

repetitions in data could be deleted by removing instances. Second, each instance

can be viewed as a supplementary degree of freedom. If we reduce the number of

instances, we can sometimes avoid over-fitting situations. With this definition

there are two types of irrelevant instances we should remove:

z The first are instances belonging to regions with very few elements; their vote

is statistically a poor estimator, and a little noise might affect their vote

dramatically. It is also common in statistical analysis to search and remove

such points, in regression, parametric estimations, etc. Removing them does

not necessarily brings a great reduction in the size of the retained instances,

but it may be helpful for future prediction tasks.

z The second are instances belonging to regions where votes can be assimilated

as being randomized. Local densities are approximately evenly distributed

with respect to the overall class distributions. These instances are not

necessarily harmful for prediction, but a great reduction in size can be

obtained after removing some of them if they are numerous. Depending on

the IS method, all or some of them would be removed.

5.2.1 Related Work

Historically, IS has been aimed first at improving the efficiency of the nearest

neighbor (NN) classifier [9]. The NN algorithm is one of the most venerable

algorithms in machine learning [10]. To classify a new instance, the Euclidean

distance (possibly weighted) is computed between this instance and each training

neighboring instance, and the new instance is assigned the class of the nearest

neighboring instance. More generally, the k nearest neighbor ( k -NN) is computed,

and the new instance is assigned the class that is most frequent among these k

neighbors. The use of the k -NN classifier was also spread and encouraged by early

theoretical results linking its generalization Bayes error.

Search WWH ::

Custom Search

Home