Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Rubin's rules to obtain an overall set of estimated coefficients and standard errors

proceed as follows. Let R denote the estimation of interest and U its estimated

variance, R being either an estimated regression coefficient or a kernel parameter

of a SVM, whatever applies. Once the MIs have been obtained, we will have

R 1 , R 2 ,..., R m estimates and their respective variances U 1 ,

U 2 ,...,

U m . The overall

estimate, occasionally called the MI estimate is given by

m

1

m

1 R i .

R

=

(4.16)

i

=

The variance for the estimate has two components: the variability within each

data set and across data sets. The within imputation variance is simply the average

of the estimated variances:

m

1

m

U

=

U i ,

(4.17)

i

=

1

whereas the between imputation variance is the sample variance of the proper esti-

mates:

m

1

1 ( R i −

2

B

=

R

)

.

(4.18)

m

−

1

i

=

The total variance T is the corrected sum of these two components with a factor that

accounts for the simulation error in R ,

1

B

1

m

= U

T

+

.

(4.19)

The square root of T is the overall standard error associated to R . In the case of no

MVs being present i n t he original data set, all R 1 , R 2 ,..., R m would be the same,

then B

U . The magnitude of B with respect to U indicates how much

information is contained in the missing portion of the data set relative to the observed

part.

In [ 83 ] the authors elaborate more on the confidence intervals extract ed from R

and how to test the null hypothesis of R

=

0 and T

=

R

=

0 by comparing the ratio

√ T with a

Student's t -distribution with degrees of freedom

1

2

mU

df

= (

m

−

1

)

+

,

(4.20)

(

m

+

1

)

B

in the case the readers would like to further their knowledge on how to use this

hypothesis to check whether the number of MI m was large enough.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home