Data Preprocessing - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Equivalently, a matrix multiplication can be applied to the input data in order to

obtain the wavelet coefficients, where the matrix used depends on the given DWT. The

matrix must be orthonormal , meaning that the columns are unit vectors and are mutu-

ally orthogonal, so that the matrix inverse is just its transpose. Although we do not have

room to discuss it here, this property allows the reconstruction of the data from the

smooth and smooth-difference data sets. By factoring the matrix used into a product of

a few sparse matrices, the resulting “fast DWT” algorithm has a complexity of O

.

n

/

for

an input vector of length n .

Wavelet transforms can be applied to multidimensional data such as a data cube. This

is done by first applying the transform to the first dimension, then to the second, and so

on. The computational complexity involved is linear with respect to the number of cells

in the cube. Wavelet transforms give good results on sparse or skewed data and on data

with ordered attributes. Lossy compression by wavelets is reportedly better than JPEG

compression, the current commercial standard. Wavelet transforms have many real-

world applications, including the compression of fingerprint images, computer vision,

analysis of time-series data, and data cleaning.

3.4.3 Principal Components Analysis

In this subsection we provide an intuitive introduction to principal components analy-

sis as a method of dimesionality reduction. A detailed theoretical explanation is beyond

the scope of this topic. For additional references, please see the bibliographic notes

(Section 3.8) at the end of this chapter.

Suppose that the data to be reduced consist of tuples or data vectors described

by n attributes or dimensions. Principal components analysis ( PCA ; also called the

Karhunen-Loeve, or K-L, method) searches for k n -dimensional orthogonal vectors that

can best be used to represent the data, where k n . The original data are thus projected

onto a much smaller space, resulting in dimensionality reduction. Unlike attribute sub-

set selection (Section 3.4.4), which reduces the attribute set size by retaining a subset of

the initial set of attributes, PCA “combines” the essence of attributes by creating an alter-

native, smaller set of variables. The initial data can then be projected onto this smaller

set. PCA often reveals relationships that were not previously suspected and thereby

allows interpretations that would not ordinarily result.

The basic procedure is as follows:

1. The input data are normalized, so that each attribute falls within the same range. This

step helps ensure that attributes with large domains will not dominate attributes with

smaller domains.

2. PCA computes k orthonormal vectors that provide a basis for the normalized input

data. These are unit vectors that each point in a direction perpendicular to the others.

These vectors are referred to as the principal components . The input data are a linear

combination of the principal components.

3. The principal components are sorted in order of decreasing “significance” or

strength. The principal components essentially serve as a new set of axes for the data,

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home