Instance Selection - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

pursued is to study the effect of scaling up the data in PS methods. Table 8.10 shows

the average results obtained in the distinct performance measures considered (it

follows the same format as Table 8.8 ) and Table 8.11 summarizes the Wilcoxon test

results over medium data sets.

We can analyze several details from the results collected in Tables. 8.10 and 8.11 :

•

Five techniques outperform 1NN in terms of accuracy/kappa over medium data

sets: RMHC, SSMA, HMNEI, MoCS and RNGE. Two of them are edition schemes

(MoCS and RNGE) and the rest are hybrid schemes. Again, no condensation

method is more accurate than 1NN.

•

Some methods present clear differences when dealing with larger data sets. This is

the case with AllKNN, MENN and CHC. The first two, tend to try new reduction

passes in the edition process, which is against the interests of accuracy and kappa,

and in medium size problems this fact is more noticeable. Furthermore, CHC loses

the balance between reduction and accuracy when data size increases, due to the

fact that the reduction objective becomes easier.

•

There are some techniques whose run could be prohibitive when the data scales

up. This is the case for RNN, RMHC, CHC and SSMA.

•

The best methods in terms of accuracy or kappa are RNGE and HMNEI.

•

The best methods considering the tradeoff reduction-accuracy/kappa are RMHC,

RNN and SSMA.

8.6.3 Global View of the Obtained Results

Assuming the results obtained, several PS methods could be emphasized according

to the accuracy/kappa obtained (RMHC, SSMA, HMNEI, RNGE), the reduction

rate achieved (SSMA, RNN, CCIS) and computational cost required (POP, FCNN).

However, we want to remark that the choice of a certain method depends on various

factors and the results are offered here with the intention of being useful in making this

decision. For example, an edition scheme will usually outperform the standard kNN

classifier in the presence of noise, but few instances will be removed. This fact could

determine whether the method is suitable or not to be applied over larger data sets,

taking into account the expected size of the resulting subset. We have seen that the

PS methods which allow high reduction rates while preserving accuracy are usually

the slowest ones (hybrid mixed approaches such as SSMA) and they may require

an advanced mechanism to be applied over large size data sets or they may even be

useless under these circumstances. Fast methods that achieve high reduction rates

are the condensation approaches, but we have seen that they are not able to improve

kNN in terms of accuracy. In short, each method has advantages and disadvantages

and the results offered in this section allow an informed decision to be made within

each category.

In short, and focusing on the objectives usually considered in the use of PS algo-

rithms, we can suggest the following, to choose the proper PS algorithm:

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home