Instance Selection Using Evolutionary Algorithms: An Experimental Study - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

10. For each pair i , the test set, t i , is D i , and the training set, T i , is the union of all

the other D j , j z i (clearly, D = T i t i and T i t i =).

Ten trials were run for each data set and IS algorithm. During the i th trial, the

algorithm is applied to Ti , and then the resulting reduced set is used by the 1-NN

algorithm for classifying the elements of t i , obtaining a test accuracy.

5.6.2.2 Instance Selection - Training Set Selection

We have followed the stratified approach for IS-TSS shown in Figure 5.3 for

carrying out the experiments on the application of the IS algorithms to the TSS. In

particular, for each data set, D , two partitions are randomly made, each consisting

of two nonoverlapping sets with 50% of the elements: D = T 11 T 12 and D =

T 21 T 22 . The IS algorithms are applied to these sets, returning four sets with a

reduced number of instances: S 11 , S 12 , S 21 , and S 22 . Then two different training sets

are calculated:

S 1 = S 11 S 12 and S 2 = S 21 S 22 .

(5.4)

Their associated test sets are s i = D \ S i , i = 1,2. The training sets are used during

the IS process, while the test sets are used to calculate the test accuracy of the

model learned. To determine the quality of the training sets obtained, two learning

algorithms, the classical 1-NN classifier and the C4.5 [31], were used on these sets.

5.6.3 Algorithms and Parameters

5.6.3.1 Instance Selection - Prototype Selection

We have executed the following classical IS algorithms: CNN, ENN, RENN,

MCS, Shrink, and Drop1-3. Moreover, we have carried out experiments with a 1-

NN classifier that considers all instances in the training sets.

The parameters used for EAs are:

z GGA considers a population with 10 chromosomes. The crossover rate is 1,

and two mutation rates were considered: 0.01 for changing 1 to 0, and 0.001

in the contrary case. This asymmetry in mutation rates is considered to favor

the presence in the population of solutions with a few instances, which is a

desirable feature. GGA was run during 1000 generations.

z SGA employs these parameters, as well, but considers 10000 offspring

evaluations.

z The population size of the CHC algorithm was 10 chromosomes, and it was

performed during 1000 generations.

z The parameters associated with PBIL were: N samples = 10, LR = 0.005, P m =

0.01, and Mut_Shif = 0.01; 1000 iterations for this algorithm were completed.

Search WWH ::

Custom Search

Home