Graphics Reference
In-Depth Information
attributes (MC) [ 36 ] is widely used, specially when many instances in the data set
contain MVs and to apply DNI would result in a very reduced and unrepresentative
pre-processed data set. This method is very simple: for nominal attributes, the MV
is replaced with the most common attribute value, and numerical values are replaced
with the average of all values of the corresponding attribute.
Avariant ofMC is the concept most common attribute value for nominal attributes,
and concept average value for numerical attributes (CMC) [ 36 ]. As stated in MC ,the
MV is replaced by the most repeated one if nominal or is the mean value if numerical,
but considers only the instances with the same class as the reference instance.
Older and rarely usedDMapproaches may be put under this category. For example
Hot deck imputation goes back over 50 years and was used quite successfully by the
Census Bureau and others. It is referred from time to time [ 84 ] and thus it is interesting
to describe it here partly for historical reasons and partly because it represents an
approach of replacing data that is missing.
Hot deck has it origins in the surveys made in USA in the 40s and 50s, when
most people felt impelled to participate in survey filling. As a consequence little data
was missing and when any registers were effectively missing, a random complete
case from the same survey was used to substitute the MVs. This process can be
simulated nowadays by clustering over the complete data, and associating the instance
with a cluster. Any complete example from the cluster can be used to fill in the
MVs [ 6 ]. Cold deck is similar to hot deck, but the cases or instances used to fill in
the MVs came from a different source. Traditionally this meant that the case used
to fill the data was obtained from a different survey. Some authors have recently
assessed the limitations imposed to the donors (the instances used to substitute the
MVs) [ 44 ].
4.4 Maximum Likelihood Imputation Methods
At the same time Rubin et al. formalized the concept of missing data introduc-
tion mechanisms described in Sect. 4.2 , they advised against use case deletion as a
methodology (IM) to deal with the MVs. However, usingMC or CMC techniques are
not much better than replacing MVs with fixed values, as they completely ignore the
mechanisms that yield the data values. In an ideal and rare case where the parameters
of the data distribution
were known, a sample from such a distribution conditioned
to the other attributes' values or not depending of whether the MCAR, MAR or
NMAR applies, would be a suitable imputed value for the missing one. The problem
is that the parameters
θ
are rarely known and also very hard to estimate [ 38 ].
In a simple case such as flipping a coin, P
θ
(
heads
) = θ
and P
(
tails
) =
1
θ
.
Depending on the coin being rigged or not, the value of
can vary and thus its value
is unknown. Our only choice is to flip the coin several times, say n , obtaining h heads
and n
θ
would be θ =
h tails. An estimation of
θ
h
/
n .
More formally, the likelihood of
θ
is obtained from a binomial distribution P
(θ) =
n θ
h
n
h .Our θ
(
θ)
θ
1
can be proven to be the maximum likelihood estimate of
.
 
Search WWH ::




Custom Search