Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

S ( t )

− (μ ( t + 1 ) )μ ( t + 1 )

Σ ( t + 1 ) =

(4.13)

X i , the conditional expectation S ( t i of the cross-products

is composed of three parts. The two parts that involve the available values in the

instance,

where, for each instance x

x obs x obs |

x obs ; μ ( t ) , Σ ( t ) ) =

x obs x obs

(

(4.14)

and

x mis x mis |

x obs ; μ ( t ) , Σ ( t ) ) =

x mis

x mis + C

(

(4.15)

is the sum of the cross-product of the imputed values and the residual covariance

matrix C

x obs ; μ ( t ) , Σ ( t ) )

, the conditional covariance matrix of

the imputation error. The normalization constant

Cov

(

x miss ,

x mis |

n of the covariance matrix estimate

[Eq. ( 4.13 )] is the number of degrees of freedom of the sample covariance matrix of

the completed data set.

The first estimation of the mean and covariance matrix needs to rely on a com-

pletely observed data set. One solution in [ 85 ] is to fill the data set with the initial

estimates of the mean and covariance matrices. The process ends when the estimates

of the mean and covariance matrix do not change over a predefined threshold. Please

note that this EM approach is only well suited for numeric data sets, constituting a

limitation for the application of EM, although an extension for mixed numerical and

nominal attributes can be found in [ 82 ].

The EM algorithm is still valid nowadays, but is usually part of a system in which

it helps to evolve some distributions like GTM neural networks in [ 95 ]. Still some

research is being carried out for EM algorithms in which its limitations are being

improved and also are applied to new fields like semi-supervised learning [ 97 ]. The

most well known version of the EM for real valued data sets is the one introduced

in [ 85 ] where the basic EM algorithm presented is extended with a regularization

parameter.

4.4.2 Multiple Imputation

One big problem of the maximum likelihood methods like EM is that they tend

to underestimate the inherent errors produced by the estimation process, formally

standard errors. TheMultiple Imputation (MI) approachwas designed to take this into

account to be a less biased imputation method, at the cost of being computationally

expensive. MI is a Monte Carlo approach described very well by [ 80 ] in which we

generate multiple imputed values from the observed data in a very similar way to

the EM algorithm: it fills the incomplete data by repeatedly solving the observed-

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home