Introduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

1.6.2.1 Feature Selection [ 19 , 21 ]

Achieves the reduction of the data set by removing irrelevant or redundant features

(or dimensions). The goal of FS is to find a minimum set of attributes, such as the

resulting probability distribution of the data output attributes, (or classes) is as close

as possible to the original distribution obtained using all attributes. It facilitates the

understanding of the pattern extracted and increases the speed of the learning stage.

1.6.2.2 Instance Selection [ 14 , 20 ]

Consists of choosing a subset of the total available data to achieve the original purpose

of the DM application as if the whole data had been used. It constitutes the family

of oriented methods that perform in a somewhat intelligent way the choice of the

best possible subset of examples from the original data by using some rules and/or

heuristics. The random selection of examples is usually known as Sampling and it is

present in a very large number of DM models for conducting internal validation and

for avoiding over-fitting.

1.6.2.3 Discretization [ 15 ]

This procedure transforms quantitative data into qualitative data, that is, numerical

attributes into discrete or nominal attributes with a finite number of intervals, obtain-

ing a non-overlapping partition of a continuous domain. An association between each

interval with a numerical discrete value is then established. Once the discretization

is performed, the data can be treated as nominal data during any DM process.

It is noteworthy that discretization is actually a hybrid data preprocessing tech-

nique involving both data preparation and data reduction tasks. Some sources include

discretization in the data transformation category and another sources consider a

data reduction process. In practice, discretization can be viewed as a data reduction

method since it maps data from a huge spectrum of numeric values to a greatly re-

duced subset of discrete values. Our decision is to mostly include it in data reduction

although we also agree with the other trend. The motivation behind this is that recent

discretization schemes try to reduce the number of discrete intervals as much as pos-

sible while maintaining the performance of the further DM process. In other words,

it is often very easy to perform basic discretization with any type of data, given that

the data is suitable for a certain algorithmwith a simple map between continuous and

categorical values. However, the real difficulty is to achieve good reduction without

compromising the quality of data, and much of the effort expended by researchers

follows this tendency.

Search WWH ::

Custom Search

Home