Database Reference
In-Depth Information
5.2 Instance Selection
In IS we want to isolate the smallest set of instances that enable us to predict the
class of a query instance with the same (or higher) accuracy as the original set [26].
By reducing the “useful” data set size, which can reduce both space and time
complexities of subsequent processing phases. One can also hope to reduce the
size of formulas obtained by a subsequent induction algorithm on the reduced and
less noisy data sets. This may facilitate interpretation tasks.
IS raises the problem of defining relevance for a prototype subset. From the
statistical viewpoint, relevance can be partly understood as the contribution to the
overall accuracy, which would be obtained by a subsequent induction. We
emphasize that removing instances does not necessarily lead to a degradation of
the results: We have observed experimentally that a small number of instances can
have performances comparable to those of the whole sample, sometimes higher.
Two reasons come to mind to explain such an observation. First, some noises or
repetitions in data could be deleted by removing instances. Second, each instance
can be viewed as a supplementary degree of freedom. If we reduce the number of
instances, we can sometimes avoid over-fitting situations. With this definition
there are two types of irrelevant instances we should remove:
z The first are instances belonging to regions with very few elements; their vote
is statistically a poor estimator, and a little noise might affect their vote
dramatically. It is also common in statistical analysis to search and remove
such points, in regression, parametric estimations, etc. Removing them does
not necessarily brings a great reduction in the size of the retained instances,
but it may be helpful for future prediction tasks.
z The second are instances belonging to regions where votes can be assimilated
as being randomized. Local densities are approximately evenly distributed
with respect to the overall class distributions. These instances are not
necessarily harmful for prediction, but a great reduction in size can be
obtained after removing some of them if they are numerous. Depending on
the IS method, all or some of them would be removed.
5.2.1 Related Work
Historically, IS has been aimed first at improving the efficiency of the nearest
neighbor (NN) classifier [9]. The NN algorithm is one of the most venerable
algorithms in machine learning [10]. To classify a new instance, the Euclidean
distance (possibly weighted) is computed between this instance and each training
neighboring instance, and the new instance is assigned the class of the nearest
neighboring instance. More generally, the k nearest neighbor ( k -NN) is computed,
and the new instance is assigned the class that is most frequent among these k
neighbors. The use of the k -NN classifier was also spread and encouraged by early
theoretical results linking its generalization Bayes error.
Search WWH ::




Custom Search