Graphics Reference
In-Depth Information
1
1
+ m
,
where
is the fraction of missing data in the data set. With a 30% of MVs in each
data set, which is a quite high amount, with 5 different final data sets a 94% of
efficiency will be achieved. Increasing the number to m
γ
10 slightly raises the
efficiency to 97%, which is a low gain paying the double computational effort.
To start we need an estimation of the mean and covariance matrices. A good
approach is to take them from a solution provided from an EM algorithm once
their values have stabilized at the end of its execution [ 83 ]. Then the DA process
starts by alternately filling the MVs and then making inferences about the unknown
parameters in a stochastic fashion. First DA creates an imputation using the available
values of the parameters of the MVs, and then draws new parameter values from the
Bayesian posterior distribution using the observed and missing data. Concatenating
this process of simulating the MVs and the parameters is what creates a Markov
chain that will converge at some point. The distribution of the parameters
=
will
stabilize to the posterior distribution averaged over the MVs, and the distribution of
the MVs will stabilize to a predictive distribution: the proper distribution needed to
drawn values for the MIs.
Large rates ofMVs in the data sets will cause the convergence to be slow. However,
the meaning of convergence is different to that used in EM. In EM the parameter
estimates have converged when they no longer change from one iteration to the
following over a threshold. In DA the distribution of the parameters do no change
across iterations but the random parameter values actually continue changing, which
makes the convergence of DAmore difficult to assess than for EM. In [ 83 ] the authors
propose to reinterpret convergence in DA in terms of lack of serial dependence: DA
can be said to have converged by k cycles if the value of any parameter at iteration
t
θ
k . As the authors
show in [ 83 ] the DA algorithm usually converges under these terms in equal or less
cycles than EM.
The value k is interesting, because it establishes when we should stop performing
the Markov chain in order to have MI that are independent draws from the missing
data predictive distribution. A typical process is to perform m runs, each one of length
k . That is, for each imputation from1 to m we perform the DAprocess during k cycles.
It is a good idea not to be too conservative with the k value, as after convergence the
process remains stationary, whereas with low k values the m imputed data sets will
not be truly independent. Remember that we do not need a high m value, so k acts
as the true computational effort measure.
Once the m MI data sets have been created, they can be analyzed by any standard
complete-data methods. For example, we can use a linear or logistic regression,
a classifier or any other technique applied to each one of the m data sets, and the
variability of the m results obtained will reflect the uncertainty of MVs. It is common
to combine the results following the rules provided by Rubin [ 80 ] that act as measures
of ordinary sample variation to obtain a single inferential statement of the parameters
of interest.
1
,
2
,...
is statistically independent of its value at iteration t
+
 
Search WWH ::




Custom Search