Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

4.5 Imputation of Missing Values. Machine Learning

Based Methods

The imputation methods presented in Sect. 4.4 originated from statistics application

and thus they model the relationship between the values by searching for the hidden

distribution probabilities. In Artificial Intelligence modeling the unknown relation-

ships between attributes and the inference of the implicit information contained in a

sample data set has been done using ML models. Immediately many authors noticed

that the same process that can be carried out to predict a continuous or a nominal

value from a previous learning process in regression or classification can be applied

to predict the MVs. The use of ML methods for imputation alleviates us from search-

ing for the estimated underlying distribution of the data, but they are still subject to

the MAR assumption in order to correctly apply them.

Batista [ 6 ] tested the classification accuracy of two popular classifiers (C4.5 and

CN2) considering the proposal of KNN as an imputation (KNNI) method and MC.

Both CN2 and C4.5 (like [ 37 ]) algorithms have their own MV estimation. From their

study, KNNI results in good accuracy, but only when the attributes are not highly

correlated to each other. Related to this work, [ 1 ] have investigated the effect of four

methods that deal with MVs. As in [ 6 ], they use KNNI and two other imputation

methods (MC and median imputation). They also use the KNN and Linear Discrimi-

nant Analysis classifiers. The results of their study show that no significantly harmful

effect in accuracy is obtained from the imputation procedure. In addition to this, they

state that the KNNI method is more robust with the increment of MVs in the data set

in respect to the other compared methods.

The idea of using ML or Soft Computing techniques as imputation methods

spread from this point on. Li et al. [ 53 ] uses a fuzzy clustering method: the Fuzzy

K-Means (FKMI). They compare the FKMI with Mean substitution and KMI

(K-Means imputation). Using a Root Mean Square Error error analysis, they state that

the basic KMI algorithm outperforms the MC method. Experiments also show that

the overall performance of the FKMI method is better than the basic KMI method,

particularly when the percentage of MVs is high. Feng et al. [ 29 ]usesanSVM

for filling in MVs (SVMI) but they do not compare this with any other imputation

methods. Furthermore, they state that we should select enough complete examples

without MVs as the training data set in this case.

In the following we proceed to describe the main details of the most used imputa-

tion methods based on ML techniques. We have tried to stay as close as possible to

the original notation used by the authors so the interested reader can easily continue

his or her exploration of details in the corresponding paper.

4.5.1 Imputation with K-Nearest Neighbor (KNNI)

Using this instance-based algorithm, every time anMV is found in a current instance,

KNNI computes the KNN and a value from them is imputed. For nominal values,

Search WWH ::

Custom Search

Home