Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Sample numerosity reduction methods replace the original data by an alterna-

tive smaller data representation. They can be either parametric or non-parametric

methods. The former requires a model estimation that fits the original data, using

parameters to represent the data instead of the actual data. They are closely-related

DMtechniques (regression and log-linear models are common parametric data reduc-

tion techniques) and we consider their explanation to be out of the scope of this topic.

However, non-parametric methods work directly with data itself and return other data

representations with similar structures. They include data sampling (Sect. 6.3 ), dif-

ferent forms of data grouping, such as data condensation , data squashing and data

clustering (Sects. 6.3.1 , 6.3.2 and 6.3.3 , respectively) and IS as a more intelligent

form of sample reduction (Chap. 8 of this topic).

Cardinality reduction comprises the transformations applied to obtain a reduced

representation of the original data. As we have mention at the beginning of this topic,

there may be a high level of overlapping between data reduction techniques and data

preparation techniques, this category being a representative example with respect to

data transformations. As data reduction, we include the binning process (Sect. 6.4 )

and the more general discretization approaches (Chap. 9 of this topic).

In the next sections, we will define the main aspects of each one of the aforemen-

tioned strategies.

6.2 The Curse of Dimensionality

A major problem in DM in large data sets with many potential predictor variables is

the the curse of dimensionality . Dimensionality becomes a serious obstacle for the

efficiency of most of the DM algorithms, because of their computational complex-

ity. This statement was coined by Richard Bellman [ 4 ] to describe a problem that

increases as more variables are added to a model.

High dimensionality of the input increases the size of the search space in an

exponential manner and also increases the chance to obtain invalid models. It is well

known that there is a linear relationship between the required number of training

samples with the dimensionality for obtaining high quality models in DM [ 8 ]. But

when considering non-parametric learners, such as those instance-based or decision

trees, the situation is even more severe. It has been estimated that as the number of

dimensions increase, the sample size needs to increase exponentially in order to have

an effective estimate of multivariate densities [ 13 ].

It is evident that the curse of dimensionality affects data differently depending

on the following DM task or algorithm. For example, techniques like decision trees

could fail to provide meaningful and understandable results when the number of

dimensions increase, although the speed in the learning stage is barely affected.

On the contrary, instance-based learning has high dependence on dimensionality

affecting its order of efficiency.

In order to alleviate this problem, a number of dimension reducers have been

developed over the years. As linear methods, we can refer to factor analysis [ 18 ] and

Search WWH ::

Custom Search

Home