Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

1

− 1

+ m

,

where

is the fraction of missing data in the data set. With a 30% of MVs in each

data set, which is a quite high amount, with 5 different final data sets a 94% of

efficiency will be achieved. Increasing the number to m

γ

10 slightly raises the

efficiency to 97%, which is a low gain paying the double computational effort.

To start we need an estimation of the mean and covariance matrices. A good

approach is to take them from a solution provided from an EM algorithm once

their values have stabilized at the end of its execution [ 83 ]. Then the DA process

starts by alternately filling the MVs and then making inferences about the unknown

parameters in a stochastic fashion. First DA creates an imputation using the available

values of the parameters of the MVs, and then draws new parameter values from the

Bayesian posterior distribution using the observed and missing data. Concatenating

this process of simulating the MVs and the parameters is what creates a Markov

chain that will converge at some point. The distribution of the parameters

=

will

stabilize to the posterior distribution averaged over the MVs, and the distribution of

the MVs will stabilize to a predictive distribution: the proper distribution needed to

drawn values for the MIs.

Large rates ofMVs in the data sets will cause the convergence to be slow. However,

the meaning of convergence is different to that used in EM. In EM the parameter

estimates have converged when they no longer change from one iteration to the

following over a threshold. In DA the distribution of the parameters do no change

across iterations but the random parameter values actually continue changing, which

makes the convergence of DAmore difficult to assess than for EM. In [ 83 ] the authors

propose to reinterpret convergence in DA in terms of lack of serial dependence: DA

can be said to have converged by k cycles if the value of any parameter at iteration

t

θ

k . As the authors

show in [ 83 ] the DA algorithm usually converges under these terms in equal or less

cycles than EM.

The value k is interesting, because it establishes when we should stop performing

the Markov chain in order to have MI that are independent draws from the missing

data predictive distribution. A typical process is to perform m runs, each one of length

k . That is, for each imputation from1 to m we perform the DAprocess during k cycles.

It is a good idea not to be too conservative with the k value, as after convergence the

process remains stationary, whereas with low k values the m imputed data sets will

not be truly independent. Remember that we do not need a high m value, so k acts

as the true computational effort measure.

Once the m MI data sets have been created, they can be analyzed by any standard

complete-data methods. For example, we can use a linear or logistic regression,

a classifier or any other technique applied to each one of the m data sets, and the

variability of the m results obtained will reflect the uncertainty of MVs. It is common

to combine the results following the rules provided by Rubin [ 80 ] that act as measures

of ordinary sample variation to obtain a single inferential statement of the parameters

of interest.

∈

1

,

2

,...

is statistically independent of its value at iteration t

+

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home