Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

•

To reduce the data by removing weaker components, with low variance. A reliable

reconstruction of the data could be possible by using only the strongest principal

components.

The final output of PCA is a new set of attributes representing the original data set.

The user would use only the first fewof these newvariables because they containmost

of the information represented in the original data. PCA can be applied to any type

of data. It is also used as a data visualization tool by reducing any multidimensional

data into two- or three-dimensional data.

6.2.2 Factor Analysis

Factor analysis is similar to PCA in the sense that it leads to the deduction of a

new, smaller set of variables that practically describe the behaviour given in the

original data. Nevertheless, factor analysis is different because it does not seek to

find transformations for the given attributes. Instead, its goal is to discover hidden

factors in the current variables [ 17 ]. Although factor analysis has an important role

as a process of data exploration, we limit its description to a data reduction method.

In factor analysis, it is assumed that there are a set of unobservable latent factors

z j ,

k ; which when acting together generate the original data. Here, the

objective is to characterize the dependency among the variables bymeans of a smaller

number of factors.

The basic idea behind factor analysis is to attempt to find a set of hidden factors

so that the current attributes can be recovered by performing a set of linear transfor-

mations over these factors. Given the set of attributes a 1 ,

,...,

a 2 ,...,

a m , factor analysis

attempts to find the set of factors f 1 ,

f 2 ,...,

f k , so that

a 1 − μ 1 =

l 11 f 1 +

l 12 f 2 +···+

l 1 k f k + ε 1

a 2 − μ 2 =

l 21 f 1 +

l 22 f 2 +···+

l 2 k f k + ε 2

a m − μ m =

l m 1 f 1 +

l m 2 f 2 +···+

l mk f k + ε m

where

a m , and the terms

ε 1 ,ε 2 ,...,ε m represent the unobservable part of the attributes, also called specific

factors .Theterms l ij , i

μ 1 ,μ 2 ,...,μ m are the means of the attributes a 1 ,

a 2 ,...,

,...,

m , j

,...,

k are known as the loadings. The

factors f 1 ,

f k are known as the common factors .

The previous equation can be written in matrix form as:

f 2 ,...,

− μ =

+ ε

Thus, the factor analysis problem can be stated as given the attributes A , along with

the mean

, we endeavor to find the set of factors F and the associated loadings L ,

and therefore the above equation is accurate.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home