Instance Selection - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Table 8.6 IS and data complexity

Description

Reference

Data characterization for effective edition and condensation schemes

[ 119 ]

Data characterization for effective PS

[ 64 ]

Usage of PS for enhance the computation of data complexity measures

[ 96 ]

Data characterization for effective under-sampling and over-sampling in

imbalanced problems

[ 115 ]

Meta-learning framework for IS

[ 103 ]

Prediction of noise filtering efficacy with data complexity measures for KNN

[ 137 ]

conditions were discussed in Chap. 2 of this topic. The data sets used are summarized

in Table 8.7 .

The data sets considered are partitioned using the 10-FCV procedure. The para-

meters of the PS algorithms are those recommended by their respective authors. We

assume that the choice of the values of parameters is optimally chosen by their own

authors. Nevertheless, in the PS methods that require the specification of the number

of neighbors as a parameter, its value coincides with the k value of the KNN rule

afterwards. But all edition methods consider a minimum of 3 nearest neighbors to

operate (as recommended in [ 165 ]), although they were applied to a 1NN classifier.

The Euclidean distance is chosen as the distance metric because it is well-known and

the most used for KNN. All probabilistic methods (including incremental methods

which depend on the order of instance presentation) are run three times and the final

results obtained correspond to the average performance values of these runs.

Thus, the empirical study involves 42 PS methods from those listed in Table 8.1 .

We want to outline that the implementations are only based on the descriptions and

specifications given by the respective authors in their papers. No advanced data struc-

tures and enhancements for improving the efficiency of PS methods have been carried

out. All methods (including the slowest ones) are collected in KEEL software [ 3 ].

8.6.1 Analysis and Empirical Results on Small Size Data Sets

Table 8.8 presents the average results obtained by the PS methods over the 39 small

size data sets. Red

.

denotes reduction rate achieved, tst Acc

.

and tst Kap

.

denote the

accuracy and kappa obtained in test data, respectively; Acc

.

correspond to the product of accuracy/kappa and reduction rate, which is an estimator

of how good a PS method is considering a tradeoff of reduction and success rate of

classification. Finally, Time denotes the average time elapsed in seconds to complete

a run of a PS method. 1 In the case of 1NN, the time required is not displayed due to the

fact that no PS stage is run before. For each type of result, the algorithms are ordered

from the best to the worst. Algorithms highlighted in bold are those which obtain

. ∗

Red

.

and Kap

. ∗

Red

1

The machine used was an Intel Core i7 CPU 920 at 2.67GHz with 4GB of RAM.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home