Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Chapter 6

Data Reduction

Abstract The most common tasks for data reduction carried out in Data Mining

consist of removing or grouping the data through the two main dimensions, examples

and attributes; and simplifying the domain of the data. A global overview to this

respect is given in Sect. 6.1 . One of the well-known problems in Data Mining is

the “curse of dimensionality”, related with the usual high amount of attributes in

data. Section 6.2 deals with this problem. Data sampling and data simplification are

introduced in Sects. 6.3 and 6.4 , respectively, providing the basic notions on these

topics for further analysis and explanation in subsequent chapters of the topic.

6.1 Overview

Currently, it is not difficult to imagine the disposal of a data warehouse for an analysis

which contains millions of samples, thousands of attributes and complex domains.

Data sets will likely be huge, thus the data analysis and mining would take a long

time to give a respond, making such analysis infeasible and even impossible.

Data reduction techniques can be applied to achieve a reduced representation of

the data set,it is much smaller in volume and tries to keep most of the integrity of

the original data [ 11 ]. The goal is to provide the mining process with a mechanism

to produce the same (or almost the same) outcome when it is applied over reduced

data instead of the original data, at the same time as when mining becomes efficient.

In this section, we first present an overview of data reduction procedures. A closer

look at each individual technique will be provided throughout this chapter.

Basic data reduction techniques are usually categorized into three main families:

DR , sample numerosity reduction and cardinality reduction .

DR ensures the reduction of the number of attributes or random variables in

the data set. DR methods include FS and feature extraction/construction (Sect. 6.2

and Chap. 7 of this topic), in which irrelevant dimensions are detected, removed or

combined. The transformation or projection of the original data onto a smaller space

can be done by PCA (Sect. 6.2.1 ), factor analysis (Sect. 6.2.2 ), MDS (Sect. 6.2.3 ) and

LLE (Sect. 6.2.4 ), being the most relevant techniques proposed in this field.

Search WWH ::

Custom Search

Home