Graphics Reference
In-Depth Information
Sample numerosity reduction methods replace the original data by an alterna-
tive smaller data representation. They can be either parametric or non-parametric
methods. The former requires a model estimation that fits the original data, using
parameters to represent the data instead of the actual data. They are closely-related
DMtechniques (regression and log-linear models are common parametric data reduc-
tion techniques) and we consider their explanation to be out of the scope of this topic.
However, non-parametric methods work directly with data itself and return other data
representations with similar structures. They include data sampling (Sect. 6.3 ), dif-
ferent forms of data grouping, such as data condensation , data squashing and data
clustering (Sects. 6.3.1 , 6.3.2 and 6.3.3 , respectively) and IS as a more intelligent
form of sample reduction (Chap. 8 of this topic).
Cardinality reduction comprises the transformations applied to obtain a reduced
representation of the original data. As we have mention at the beginning of this topic,
there may be a high level of overlapping between data reduction techniques and data
preparation techniques, this category being a representative example with respect to
data transformations. As data reduction, we include the binning process (Sect. 6.4 )
and the more general discretization approaches (Chap. 9 of this topic).
In the next sections, we will define the main aspects of each one of the aforemen-
tioned strategies.
6.2 The Curse of Dimensionality
A major problem in DM in large data sets with many potential predictor variables is
the the curse of dimensionality . Dimensionality becomes a serious obstacle for the
efficiency of most of the DM algorithms, because of their computational complex-
ity. This statement was coined by Richard Bellman [ 4 ] to describe a problem that
increases as more variables are added to a model.
High dimensionality of the input increases the size of the search space in an
exponential manner and also increases the chance to obtain invalid models. It is well
known that there is a linear relationship between the required number of training
samples with the dimensionality for obtaining high quality models in DM [ 8 ]. But
when considering non-parametric learners, such as those instance-based or decision
trees, the situation is even more severe. It has been estimated that as the number of
dimensions increase, the sample size needs to increase exponentially in order to have
an effective estimate of multivariate densities [ 13 ].
It is evident that the curse of dimensionality affects data differently depending
on the following DM task or algorithm. For example, techniques like decision trees
could fail to provide meaningful and understandable results when the number of
dimensions increase, although the speed in the learning stage is barely affected.
On the contrary, instance-based learning has high dependence on dimensionality
affecting its order of efficiency.
In order to alleviate this problem, a number of dimension reducers have been
developed over the years. As linear methods, we can refer to factor analysis [ 18 ] and
 
Search WWH ::




Custom Search