Instance Selection Using Evolutionary Algorithms: An Experimental Study - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

Table 5.2. Data sets for IS-TSS.

Data set

Num. instances

Num. features

Num. classes

Pen-based recognition

10992

16

10

SatImage

6435

36

6

Thyroid

7200

21

3

5.6.1.2 Instance Selection - Training Set Selection

To adequately study the behavior of the IS algorithm on the TSS, we should

consider data sets with a larger number of instances than the data sets in Table 5.1.

Therefore, we have chosen three databases that contain more than 6000

individuals, and up to 11,000, which allow an analysis of the scaling up associated

with the IS algorithms to be made. They are shown in Table 5.2.

Pen-Based Recognition : A digit database was created by collecting 250

samples from 44 writers. A WACOM PL-100V pressure-sensitive tablet with an

integrated LCD display and a cordless stylus were used. The input and display

areas are located in the same place. Attached to the serial port of an Intel 486-

based PC, it allows us to collect handwriting samples. These writers are asked to

write 250 digits in random order inside boxes of 500-by-500 tablet pixel resolution.

The raw data that we capture from the tablet consist of integer values between 0

and 500.

SatImage : The database consists of the multispectral values of pixels in 3x3

neighborhoods in a satellite image, and the classification associated with the

central pixel in each neighborhood. The aim is to predict this classification, given

the multispectral values. In the sample database, the class of a pixel is coded as a

number.

Thyroid : The aim is to determine whether a patient referred to the clinic is

hypothyroid. Therefore three classes are built: normal (not hypothyroid),

hyperfunction, and subnormal functioning.

5.6.2 Partitions

Due to the different strategy followed in IS-PS and IS-TSS, we have taken into

account different models of partitions for each one.

5.6.2.1 Instance Selection - Prototype Selection

The sets considered for IS-PS are partitioned using the ten -fold cross-validation

procedure . Each data set, D , is randomly divided into ten disjoint sets of equal

size, D 1 … D 10 . We then conduct ten pairs of training and test sets, ( Ti ti ), i =1, …,

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home