Graphics Reference
In-Depth Information
organization, business environments or nature evolution [ 109 , 133 ]. It is conceived
as a real necessity in the world surrounding us, thus also in DM. Many circumstances
lead to perform a data selection, as we have enumerated previously. Remembering
them, data is not pure and initially not prepared for DM; there is missing data; there is
irrelevant redundant data; errors are more likely to occur during collecting or storing;
data could be too overwhelming to manage.
IS is to choose a subset of data to achieve the original purpose of a DM application
as if the whole data were used [ 42 , 127 ]. However, from our point of view, data
reduction by means of data subset selection is not always IS. We correspond IS
with an intelligent operation of instance categorization, according to a degree of
irrelevance or noise and depending on the DM task. In this way, for example, we
do not consider data sampling as IS per se, because it has a more general purpose
and the underlying purpose is to reduce the data randomly to enhance later learning
tasks. Nevertheless, data sampling [ 49 ] also belongs to the data reduction family of
methods and was mentioned in Chap. 6 of this topic.
The optimal outcome of IS is a minimum data subset, model independent that can
accomplish the same task with no performance loss. Thus, P
,
where P is the performance, DM is the DM algorithm, s is the subset of instance
selected and t is the complete or training set of instances. According to Liu [ 109 ],
IS has the following outstanding functions:
(
DM s ) =
P
(
DM t )
Enabling : IS makes the impossible possible. When the data set is too huge, it may
not be possible to run a DM algorithm or the DM task might not be able to be
effectively performed. IS enables a DM algorithm to work with huge data.
Focusing : The data are formed by a lot of information of almost everything in a
domain, but a concrete DM task is focused on only one aspect of interest of the
domain. IS focus the data on the relevant part.
Cleaning : By selecting relevant instances, redundant as well as noisy instances are
usually removed, improving the quality of the input data and, hence, expecting to
also improve the DM performance.
In this chapter, we emphasize the importance of IS nowadays, since it is very
common that databases exceed the size of data which DM algorithms can properly
handle. As another topic for data reduction, it has recently been attracting more and
more attention from researchers and practitioners. Experience has shown that when
a DM algorithm is applied to the reduced data set, it still achieves sufficient and
suitable results if the selection strategy has been well chosen taking into account the
later situation. The situation will be conditioned by the learning task, DM algorithm
and outcome expectations.
This topic is especially oriented towards classification, thus we also focus the
goal of an IS method on obtaining a subset S
T such that S does not contain
=
superfluous instances and Acc
is the classification
accuracy obtained using X as a training set. Henceforth, S is used to denote the
selected subset. As the training set is reduced, the runtime of the training process
will be also reduced for the classifier, especially in those instance-based or lazy
learning methods [ 68 ].
(
S
)
Acc
(
T
)
, where Acc
(
X
)
 
Search WWH ::




Custom Search