Graphics Reference
In-Depth Information
4.5 Imputation of Missing Values. Machine Learning
Based Methods
The imputation methods presented in Sect. 4.4 originated from statistics application
and thus they model the relationship between the values by searching for the hidden
distribution probabilities. In Artificial Intelligence modeling the unknown relation-
ships between attributes and the inference of the implicit information contained in a
sample data set has been done using ML models. Immediately many authors noticed
that the same process that can be carried out to predict a continuous or a nominal
value from a previous learning process in regression or classification can be applied
to predict the MVs. The use of ML methods for imputation alleviates us from search-
ing for the estimated underlying distribution of the data, but they are still subject to
the MAR assumption in order to correctly apply them.
Batista [ 6 ] tested the classification accuracy of two popular classifiers (C4.5 and
CN2) considering the proposal of KNN as an imputation (KNNI) method and MC.
Both CN2 and C4.5 (like [ 37 ]) algorithms have their own MV estimation. From their
study, KNNI results in good accuracy, but only when the attributes are not highly
correlated to each other. Related to this work, [ 1 ] have investigated the effect of four
methods that deal with MVs. As in [ 6 ], they use KNNI and two other imputation
methods (MC and median imputation). They also use the KNN and Linear Discrimi-
nant Analysis classifiers. The results of their study show that no significantly harmful
effect in accuracy is obtained from the imputation procedure. In addition to this, they
state that the KNNI method is more robust with the increment of MVs in the data set
in respect to the other compared methods.
The idea of using ML or Soft Computing techniques as imputation methods
spread from this point on. Li et al. [ 53 ] uses a fuzzy clustering method: the Fuzzy
K-Means (FKMI). They compare the FKMI with Mean substitution and KMI
(K-Means imputation). Using a Root Mean Square Error error analysis, they state that
the basic KMI algorithm outperforms the MC method. Experiments also show that
the overall performance of the FKMI method is better than the basic KMI method,
particularly when the percentage of MVs is high. Feng et al. [ 29 ]usesanSVM
for filling in MVs (SVMI) but they do not compare this with any other imputation
methods. Furthermore, they state that we should select enough complete examples
without MVs as the training data set in this case.
In the following we proceed to describe the main details of the most used imputa-
tion methods based on ML techniques. We have tried to stay as close as possible to
the original notation used by the authors so the interested reader can easily continue
his or her exploration of details in the corresponding paper.
4.5.1 Imputation with K-Nearest Neighbor (KNNI)
Using this instance-based algorithm, every time anMV is found in a current instance,
KNNI computes the KNN and a value from them is imputed. For nominal values,
 
Search WWH ::




Custom Search