Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

4.5.9 Recent Machine Learning Approaches to Missing Values

Imputation

Although we have tried to provide an extensive introduction to the most used and

basic imputation methods based on ML techniques, there is a great amount of journal

publications showing their application and particularization to real world problems.

We would like to give the reader a summarization of the latest and more important

imputation methods presented at the current date of publication, both extensions of

the introduced ones and completely novel ones in Table 4.1 .

4.6 Experimental Comparative Analysis

In this section we aim to provide the reader with a general overview of the behavior

and properties of all the imputation methods presented above. However, this is not

an easy task. The main question is: what is a good imputation method?

As multiple imputation is a very resource consuming approach, we will focus on

the single imputation methods described in this chapter.

4.6.1 Effect of the Imputation Methods in the Attributes'

Relationships

From an unsupervised data point of view, those imputation methods able to generate

values close to the true but unknown MV should be the best. This idea has been

explored in the literature by means of using complete data sets and then artificially

introducing MVs. Please note that such a mechanism will act as a MCAR MV

generator mechanism, validating the use of imputation methods. Then, imputation

methods are applied to the data and an estimation of how far is the estimation to the

original (and known) value. Authors usually choose the mean square error (MSE)

or root mean square error (RMSE) to quantify and compare the imputation methods

over a set of data sets [ 6 , 32 , 41 , 77 ].

On the other hand, other problems arise when we do not have the original values

or the problem is supervised. In classification, for example, it is more demanding

to impute values that will constitute an easier and more generalizable problem. As

a consequence in this paradigm a good imputation method will enable the classifier

to obtain better accuracy. This is harder to measure, as we are relating two different

values: the MV itself and the class label assigned to the example. Neither MSE or

RMSE can provide us with such kind of information.

One way to measure how good the imputation is for the supervised task is to

use Wilson's Noise Ratio. This measure proposed by [ 98 ] observes the noise in the

data set. For each instance of interest, the method looks for the KNN (using the

Search WWH ::

Custom Search

Home