Graphics Reference
In-Depth Information
4.5.9 Recent Machine Learning Approaches to Missing Values
Imputation
Although we have tried to provide an extensive introduction to the most used and
basic imputation methods based on ML techniques, there is a great amount of journal
publications showing their application and particularization to real world problems.
We would like to give the reader a summarization of the latest and more important
imputation methods presented at the current date of publication, both extensions of
the introduced ones and completely novel ones in Table 4.1 .
4.6 Experimental Comparative Analysis
In this section we aim to provide the reader with a general overview of the behavior
and properties of all the imputation methods presented above. However, this is not
an easy task. The main question is: what is a good imputation method?
As multiple imputation is a very resource consuming approach, we will focus on
the single imputation methods described in this chapter.
4.6.1 Effect of the Imputation Methods in the Attributes'
Relationships
From an unsupervised data point of view, those imputation methods able to generate
values close to the true but unknown MV should be the best. This idea has been
explored in the literature by means of using complete data sets and then artificially
introducing MVs. Please note that such a mechanism will act as a MCAR MV
generator mechanism, validating the use of imputation methods. Then, imputation
methods are applied to the data and an estimation of how far is the estimation to the
original (and known) value. Authors usually choose the mean square error (MSE)
or root mean square error (RMSE) to quantify and compare the imputation methods
over a set of data sets [ 6 , 32 , 41 , 77 ].
On the other hand, other problems arise when we do not have the original values
or the problem is supervised. In classification, for example, it is more demanding
to impute values that will constitute an easier and more generalizable problem. As
a consequence in this paradigm a good imputation method will enable the classifier
to obtain better accuracy. This is harder to measure, as we are relating two different
values: the MV itself and the class label assigned to the example. Neither MSE or
RMSE can provide us with such kind of information.
One way to measure how good the imputation is for the supervised task is to
use Wilson's Noise Ratio. This measure proposed by [ 98 ] observes the noise in the
data set. For each instance of interest, the method looks for the KNN (using the
 
Search WWH ::




Custom Search