Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Recently a software package has been published in which the MCAR condition can

be tested [ 43 ].

A third case arises when MAR does not apply as the MV depends on both the rest

of observed values and the proper value itself. That is

P

(

B

|

X obs ,

X mis ,ζ)

(4.4)

is the actual probability estimation. This model is usually called not missing at

random (NMAR) or missing not at random (MNAR) in the literature. This model of

missingness is a challenge for the user as the only way to obtain an unbiased estimate

is to model the missingness as well. This is a very complex task in which we should

create a model accounting for the missing data that should be later incorporated to

a more complex model used to estimate the MVs. However, even when we cannot

account for the missingness model, the introduced bias may be still small enough.

In [ 23 ] the reader can find an example of how to perform this.

4.3 Simple Approaches to Missing Data

In this section we introduce the most simplistic methods used to deal with MVs. As

they are very simple, they usually do not take into account themissingnessmechanism

and they blindly perform the operation.

The most simple approach is to do not impute (DNI). As its name indicates,

all the MVs remain unreplaced, so the DM algorithm must use their default MVs

strategies if present. Often the objective is to verify whether imputation methods

allow the classification methods to perform better than when using the original data

sets. As a guideline, in [ 37 ] a previous study of imputation methods is presented. As

an alternative for these learning methods that cannot deal with explicit MVs notation

(as a special value for instance) another approach is to convert the MVs to a new

value (encode them into a new numerical value), but such a simplistic method has

been shown to lead to serious inference problems [ 82 ].

A very common approach in the specialized literature, even nowadays, is to apply

case deletion or ignore missing (IM). Using this method, all instances with at least

one MV are discarded from the data set. Although IM often results in a substantial

decrease in the sample size available for the analysis, it does have important advan-

tages. In particular, under the assumption that data is MCAR, it leads to unbiased

parameter estimates. Unfortunately, even when the data are MCAR there is a loss in

power using this approach, especially if we have to rule out a large number of sub-

jects. And when the data is not MCAR, it biases the results. For example when low

income individuals are less likely to report their income level, the resulting mean

is biased in favor of higher incomes. The alternative approaches discussed below

should be considered as a replacement for IM.

Often seen as a good choice, the substitution of the MVs for the global most

common attribute value for nominal attributes, and global average value for numerical

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home