Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

not talked about any imputation yet. The reason is EM is a meta algorithm that it is

adapted to a particular application.

To use EM for imputation first we need to choose a plausible set of parameters,

that is, we need to assume that the data follows a probability distribution, which is

usually seen as a drawback of these kind of methods. The EM algorithmworks better

with probability distributions that are easy to maximize, as Gaussian mixture models.

In [ 85 ] an approach of EM using multivariate Gaussian is proposed as using multi-

variate Gaussian data can be parameterized by the mean and the covariance matrix.

In each iteration of the EM algorithm for imputation the estimates of the mean

μ

and the covariance

are represented by a matrix and revised in three phases. These

parameters are used to apply a regression over the MVs by using the complete data.

In the first one in each instance with MVs the regression parameters B for the MVs

are calculated from the current estimates of the mean and covariance matrix and the

available complete data. Next theMVs are imputedwith their conditional expectation

values from the available complete ones and the estimated regression coefficients

Σ

x mis = μ mis + (

x obs − μ obs )

B

+

e

,

(4.9)

where the instance x of n attributes is separated into the observed values x obs and

the missing ones x mis . The mean and covariance matrix are also separated in such a

way. The residual e

n mis is assumed to be a random vector with mean zero and

unknown covariance matrix. These two phases would complete the E-step. Please

note that for the iteration of the algorithm the imputation is not strictly needed as

only the estimates of the mean and covariance matrix are, as well as the regression

parameters. But our ultimate goal is to have our data set filled, so we use the latest

regression parameters to create the best imputed values so far.

In the third phase the M-step is completed by re-estimating the mean a covari-

ance matrix. The mean is taken as the sample mean of the completed data set and

the covariance is the sample covariance matrix and the covariance matrices of the

imputation errors as shown in [ 54 ]. That is:

1

×

∈ R

B

= Σ − 1

obs

obs Σ obs , mis ,

and

(4.10)

,

C

= Σ mis , mis − Σ mis , obs Σ − 1

obs , obs Σ obs , mis

(4.11)

The hat accent A designates an estimate of a quantity A . After updating B and C the

mean and covariance matrix must be updated with

n

1

n

μ ( t + 1 ) =

X i

(4.12)

i = 1

and

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home