Graphics Reference
In-Depth Information
1.6.2.1 Feature Selection [ 19 , 21 ]
Achieves the reduction of the data set by removing irrelevant or redundant features
(or dimensions). The goal of FS is to find a minimum set of attributes, such as the
resulting probability distribution of the data output attributes, (or classes) is as close
as possible to the original distribution obtained using all attributes. It facilitates the
understanding of the pattern extracted and increases the speed of the learning stage.
1.6.2.2 Instance Selection [ 14 , 20 ]
Consists of choosing a subset of the total available data to achieve the original purpose
of the DM application as if the whole data had been used. It constitutes the family
of oriented methods that perform in a somewhat intelligent way the choice of the
best possible subset of examples from the original data by using some rules and/or
heuristics. The random selection of examples is usually known as Sampling and it is
present in a very large number of DM models for conducting internal validation and
for avoiding over-fitting.
1.6.2.3 Discretization [ 15 ]
This procedure transforms quantitative data into qualitative data, that is, numerical
attributes into discrete or nominal attributes with a finite number of intervals, obtain-
ing a non-overlapping partition of a continuous domain. An association between each
interval with a numerical discrete value is then established. Once the discretization
is performed, the data can be treated as nominal data during any DM process.
It is noteworthy that discretization is actually a hybrid data preprocessing tech-
nique involving both data preparation and data reduction tasks. Some sources include
discretization in the data transformation category and another sources consider a
data reduction process. In practice, discretization can be viewed as a data reduction
method since it maps data from a huge spectrum of numeric values to a greatly re-
duced subset of discrete values. Our decision is to mostly include it in data reduction
although we also agree with the other trend. The motivation behind this is that recent
discretization schemes try to reduce the number of discrete intervals as much as pos-
sible while maintaining the performance of the further DM process. In other words,
it is often very easy to perform basic discretization with any type of data, given that
the data is suitable for a certain algorithmwith a simple map between continuous and
categorical values. However, the real difficulty is to achieve good reduction without
compromising the quality of data, and much of the effort expended by researchers
follows this tendency.
 
Search WWH ::




Custom Search