Instance Selection - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

organization, business environments or nature evolution [ 109 , 133 ]. It is conceived

as a real necessity in the world surrounding us, thus also in DM. Many circumstances

lead to perform a data selection, as we have enumerated previously. Remembering

them, data is not pure and initially not prepared for DM; there is missing data; there is

irrelevant redundant data; errors are more likely to occur during collecting or storing;

data could be too overwhelming to manage.

IS is to choose a subset of data to achieve the original purpose of a DM application

as if the whole data were used [ 42 , 127 ]. However, from our point of view, data

reduction by means of data subset selection is not always IS. We correspond IS

with an intelligent operation of instance categorization, according to a degree of

irrelevance or noise and depending on the DM task. In this way, for example, we

do not consider data sampling as IS per se, because it has a more general purpose

and the underlying purpose is to reduce the data randomly to enhance later learning

tasks. Nevertheless, data sampling [ 49 ] also belongs to the data reduction family of

methods and was mentioned in Chap. 6 of this topic.

The optimal outcome of IS is a minimum data subset, model independent that can

accomplish the same task with no performance loss. Thus, P

,

where P is the performance, DM is the DM algorithm, s is the subset of instance

selected and t is the complete or training set of instances. According to Liu [ 109 ],

IS has the following outstanding functions:

(

DM s ) =

P

(

DM t )

•

Enabling : IS makes the impossible possible. When the data set is too huge, it may

not be possible to run a DM algorithm or the DM task might not be able to be

effectively performed. IS enables a DM algorithm to work with huge data.

•

Focusing : The data are formed by a lot of information of almost everything in a

domain, but a concrete DM task is focused on only one aspect of interest of the

domain. IS focus the data on the relevant part.

•

Cleaning : By selecting relevant instances, redundant as well as noisy instances are

usually removed, improving the quality of the input data and, hence, expecting to

also improve the DM performance.

In this chapter, we emphasize the importance of IS nowadays, since it is very

common that databases exceed the size of data which DM algorithms can properly

handle. As another topic for data reduction, it has recently been attracting more and

more attention from researchers and practitioners. Experience has shown that when

a DM algorithm is applied to the reduced data set, it still achieves sufficient and

suitable results if the selection strategy has been well chosen taking into account the

later situation. The situation will be conditioned by the learning task, DM algorithm

and outcome expectations.

This topic is especially oriented towards classification, thus we also focus the

goal of an IS method on obtaining a subset S

⊂

T such that S does not contain

=

superfluous instances and Acc

is the classification

accuracy obtained using X as a training set. Henceforth, S is used to denote the

selected subset. As the training set is reduced, the runtime of the training process

will be also reduced for the classifier, especially in those instance-based or lazy

learning methods [ 68 ].

(

S

)

Acc

(

T

)

, where Acc

(

X

)

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home