Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

where f is the probability function for a single case and

represents the parameters

of the model that yield such a particular instance of data. The main problem is that

the particular parameters' values

for the given data are very rarely known. For this

reason authors usually overcome this problem by considering distributions that are

commonly found in nature and their properties are well known as well. The three

distributions that standout among these are:

1. the multivariate normal distribution in the case of only real valued parameters;

2. the multinomial model for cross-classified categorical data (including loglinear

models) when the data consists of nominal features; and

3. mixed models for combined normal and categorical features in the data [ 50 , 55 ].

If we call X obs the observed part of X and we denote the missing part as X mis so

that X

, we can provide a first intuitive definition of what missing at

random (MAR) means. Informally talking, when the probability that an observation

is missing may depend on X obs but not on X mis we can state that the missing data is

missing at random.

In the case of MAR missing data mechanism, given a particular value or val-

ues for a set of features belonging to X obs , the distribution of the rest of features

is the same among the observed cases as it is among the missing cases. Follow-

ing Schafer's example based on [ 79 ], let suppose that we dispose an n

= (

X obs ,

X mis )

p matrix

called B of variables whose values are 1 or 0 when X elements are observed and

missing respectively. The distribution of B should be related to X and to some

unknown parameters

, so we dispose a probability model for B described by

(

,ζ)

. Having a MAR assumption means that this distribution does not depend

on X mis :

(

X obs ,

X mis ,ζ) =

(

X obs ,ζ).

(4.2)

Please be aware of MAR does not suggest that the missing data values consti-

tute just another possible sample from the probability distribution. This condition is

known as missing completely at random (MCAR). Actually MCAR is a special case

of MAR in which the distribution of an example having a MV for an attribute does

not depend on either the observed or the unobserved data. Following the previous

notation, we can say that

(

X obs ,

X mis ,ζ) =

(

| ζ).

(4.3)

Although there will generally be some loss of information, comparable results can be

obtained with missing data by carrying out the same analysis that would have been

performed with no MVs. In practice this means that, under MCAR, the analysis of

only those units with complete data gives valid inferences.

Please note that MCAR is more restrictive than MAR. MAR requires only that

the MVs behave like a random sample of all values in some particular subclasses

defined by observed data. In such a way, MAR allows the probability of a missing

datum to depend on the datum itself, but only indirectly through the observed values.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home