Dealing with Missing Values - Data Preprocessing in Data Mining - page 67

Graphics Reference

In-Depth Information

data. But a significative difference between the two methods is attained: while EM

generates a single imputation in each step from the estimated parameters at each

step, MI performs several imputations that yield several complete data sets.

This repeated imputation can be done thanks to the use of Markov Chain Monte

Carlo methods, as the several imputations are obtained by introducing a random

component, usually from a standard normal distribution. In a more advanced fashion,

MI also considers that the parameters estimates are in fact sample estimates. Thus,

the parameters are not directly estimated from the available data but, as the process

continues, they are drawn from their Bayesian posterior distributions given the data at

hand. These assumptions means that only in the case of MCAR or MARmissingness

mechanisms hold MI should be applied.

As a result Eq. ( 4.9 ) can be applied with slight changes as the e term is now a

sample from a standard normal distribution and is applied more than once to obtain

several imputed values for a single MV. As indicated in the previous paragraph,

MI has a Bayesian nature that forces the user to specify a prior distribution for the

parameters

of the model fromwhich the e term is drawn. In practice [ 83 ]isstressed

out that the results depend more on the election of the distribution for the data than

the distribution for

θ

. Unlike the single imputation performed by EMwhere only one

imputed value for eachMV is created (and thus only one value of e is drawn), MI will

create several versions of the data set, where the observed data X obs is essentially the

same, but the imputed values for X mis will be different in each data set created. This

process is usually known as data augmentation (DA) [ 91 ] as depicted in Fig. 4.2 .

Surprisingly not many imputation steps are needed. Rubin claims in [ 80 ] that only

3-5 steps are usually needed. He states that the efficiency of the final estimation built

upon m imputations is approximately:

θ

Fig. 4.2 Multiple imputation process by data augmentation. Every MV denoted by a '?' is replaced

by several imputed and different values that will be used to continue the process later

Next Page

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home