Graphics Reference
In-Depth Information
Chapter 6
Data Reduction
Abstract The most common tasks for data reduction carried out in Data Mining
consist of removing or grouping the data through the two main dimensions, examples
and attributes; and simplifying the domain of the data. A global overview to this
respect is given in Sect. 6.1 . One of the well-known problems in Data Mining is
the “curse of dimensionality”, related with the usual high amount of attributes in
data. Section 6.2 deals with this problem. Data sampling and data simplification are
introduced in Sects. 6.3 and 6.4 , respectively, providing the basic notions on these
topics for further analysis and explanation in subsequent chapters of the topic.
6.1 Overview
Currently, it is not difficult to imagine the disposal of a data warehouse for an analysis
which contains millions of samples, thousands of attributes and complex domains.
Data sets will likely be huge, thus the data analysis and mining would take a long
time to give a respond, making such analysis infeasible and even impossible.
Data reduction techniques can be applied to achieve a reduced representation of
the data set,it is much smaller in volume and tries to keep most of the integrity of
the original data [ 11 ]. The goal is to provide the mining process with a mechanism
to produce the same (or almost the same) outcome when it is applied over reduced
data instead of the original data, at the same time as when mining becomes efficient.
In this section, we first present an overview of data reduction procedures. A closer
look at each individual technique will be provided throughout this chapter.
Basic data reduction techniques are usually categorized into three main families:
DR , sample numerosity reduction and cardinality reduction .
DR ensures the reduction of the number of attributes or random variables in
the data set. DR methods include FS and feature extraction/construction (Sect. 6.2
and Chap. 7 of this topic), in which irrelevant dimensions are detected, removed or
combined. The transformation or projection of the original data onto a smaller space
can be done by PCA (Sect. 6.2.1 ), factor analysis (Sect. 6.2.2 ), MDS (Sect. 6.2.3 ) and
LLE (Sect. 6.2.4 ), being the most relevant techniques proposed in this field.
 
Search WWH ::




Custom Search