Instance Selection - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

8.3.1.2 Type of Selection

This factor is mainly conditioned by the type of search carried out by the PS algo-

rithms, whether they seek to retain border points, central points or some other set of

points.

•

Condensation: This set includes the techniques which aim to retain the points

which are closer to the decision boundaries, also called border points. The intuition

behind retaining border points is that internal points do not affect the decision

boundaries as much as border points, and thus can be removed with relatively

little effect on classification. The idea is to preserve the accuracy over the training

set, but the generalization accuracy over the test set can be negatively affected.

Nevertheless, the reduction capability of condensation methods is normally high

due to the fact that there are fewer border points than internal points in most of the

data.

•

Edition: These kinds of algorithms instead seek to remove border points. They

remove points that are noisy or do not agree with their neighbors. This removes

boundary points, leaving smoother decision boundaries behind. However, such

algorithms do not remove internal points that do not necessarily contribute to

the decision boundaries. The effect obtained is related to the improvement of

generalization accuracy in test data, although the reduction rate obtained is low.

•

Hybrid: Hybrid methods try to find the smallest subset S which maintains or even

increases the generalization accuracy in test data. To achieve this, it allows the

removal of internal and border points based on criteria followed by the two previous

strategies. The KNN classifier is highly adaptable to these methods, obtaining great

improvements even with a very small subset of instances selected.

8.3.1.3 Evaluation of Search

KNN is a simple technique and it can be used to direct the search of a PS algorithm.

The objective pursued is to make a prediction on a non-definitive selection and to

compare between selections. This characteristic influences the quality criterion and

it can be divided into:

•

Filter: When the kNN rule is used for partial data to determine the criteria of

adding or removing and no leave-one-out validation scheme is used to obtain a

good estimation of generalization accuracy. The fact of using subsets of the training

data in each decision increments the efficiency of these methods, but the accuracy

may not be enhanced.

•

Wrapper: When the kNN rule is used for the complete training set with the

leave-one-out validation scheme. The conjunction in the use of the two mentioned

factors allows us to get a great estimation of generalization accuracy, which helps to

obtain better accuracy over test data. However, each decision involves a complete

computation of the kNN rule over the training set and the learning phase can be

computationally expensive.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home