Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

T y . Nowwe assume the existence ofMVs. In PC regression,

the missing part y miss in the expression vector y is estimated from the observed part

y obs by using the PCA result. Let w l obs and w l miss be parts of each principal axis w l ,

corresponding to the observed and missing parts, respectively, in y . Similarly, let

y is given by x l = (

w l /λ l )

= (

W obs ,

W miss )

where W obs or W miss denotes a matrix whose column vectors are

w obs ,...,

w obs or w miss ,...,

w miss , respectively.

Factor scores x

= (

x 1 ,...,

x K )

for the example vector y are obtained by mini-

mization of the residual error

err

y obs −

W obs x

This is a well-known regression problem, and the least square solution is given by

W obsT W obs ) − 1 W obsT y obs .

= (

Using x , the missing part is estimated as

y miss =

W miss x

(4.23)

In the PC regression above, W should be known beforehand. Later, we will discuss

the way to determine the parameter.

4.4.3.2 Bayesian Estimation

A parametric probabilistic model, which is called probabilistic PCA (PPCA), has

been proposed recently. The probabilistic model is based on the assumption that the

residual error

and the factor scores x l (

≤

)

in Equation (reflinearcomb)

obey normal distributions:

(

) = N K (

I K ),

() = N D ( |

/τ )

I D ),

where

N K (

| μ, Σ)

denotes a K -dimensional normal distribution for x , whose mean

and covariance are

and

, respectively. I K is a

(

)

identity matrix and

is a

scalar inverse variance of

. In this PPCA model, a complete log-likelihood function

is written as:

ln p

(

| θ) ≡

ln p

(

,μ,τ)

=− 2

2 ln

−

− τ

−

τ −

ln 2

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home