Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

The basic idea is to find a set of linear transformations of the original variables

which could describe most of the variance using a relatively fewer number of vari-

ables. Hence, it searches for kn -dimensional orthogonal vectors that can best repre-

sent the data, where k

n . The new set of attributes are derived in a decreasing order

of contribution, letting the first obtained variable, the one called principal component

contain the largest proportion of the variance of the original data set. Unlike FS, PCA

allows the combination of the essence of original attributes to form a new smaller

subset of attributes.

The usual procedure is to keep only the first few principal components that may

contain 95%or more of the variance of the original data set. PCA is particularly useful

when there are too many independent variables and they show high correlation.

The basic procedure is as follows:

≤

•

To normalize the input data, equalizing the ranges among attributes.

•

To compute k orthonormal vectors to provide a basis for the normalized input

data. These vectors point to a direction that is perpendicular to the others and

are called principal components . The original data is in linear combination of the

principal components. In order to calculate them, the eigenvalue-eigenvectors of

the covariance matrix from the sample data are needed.

•

To sort the principal components according to their strength, given by their asso-

ciated eigenvalues. The principal components serve as a new set of axes for the

data, adjusted according the variance of the original data. In Fig. 6.1 ,weshowan

illustrative example of the first two principal components for a given data set.

Fig. 6.1 PCA. X and Y are the first two principal components obtained

Search WWH ::

Custom Search

Home