Graphics Reference
In-Depth Information
from a research study, and can also limit the generalizability of the research findings
[ 96 ]. Three types of problems are usually associated with MVs in DM [ 5 ]:
1. loss of efficiency;
2. complications in handling and analyzing the data; and
3. bias resulting from differences between missing and complete data.
Recently some authors have tried to estimate howmanyMVs are needed to noticeably
harm the prediction accuracy in classification [ 45 ].
Usually the treatment of MVs in DM can be handled in three different ways [ 27 ]:
The first approach is to discard the examples withMVs in their attributes. Therefore
deleting attributes with elevated levels of MVs is included in this category too.
Another approach is the use of maximum likelihood procedures, where the para-
meters of a model for the complete portion of the data are estimated, and later used
for imputation by means of sampling.
Finally, the imputation of MVs is a class of procedures that aims to fill in the MVs
with estimated ones. In most cases, a data set's attributes are not independent from
each other. Thus, through the identification of relationships among attributes, MVs
can be determined
We will focus our attention on the use of imputation methods. A fundamental advan-
tage of this approach is that theMV treatment is independent of the learning algorithm
used. For this reason, the user can select the most appropriate method for each situ-
ation faced. There is a broad family of imputation methods, from simple imputation
techniques like mean substitution, KNN, etc.; to those which analyze the relation-
ships between attributes such as: SVM-based, clustering-based, logistic regressions,
maximum likelihood procedures and multiple imputation [ 6 , 26 ].
The use of imputation methods for MVs is a task with a well established back-
ground. It is possible to track the first formal studies to several decades ago. The
work of [ 54 ] laid the foundation of further work in this topic, specially in statis-
tics. From their work, imputation techniques based on sampling from estimated data
distributions followed, distinguishing between single imputation procedures (like
Expectation-Maximization (EM) procedures [ 81 ]) andmultiple imputation ones [ 82 ],
the latter being more reliable and powerful but more difficult and restrictive to be
applied.
These imputation procedures became very popular for quantitative data, and there-
fore they were easily adopted in other fields of knowledge, like bioinformatics
[ 49 , 62 , 93 ], climatic science [ 85 ], medicine [ 94 ], etc. The imputation methods
proposed in each field are adapted to the common characteristics of the data ana-
lyzed in it. With the popularization of the DM field, many studies in the treatment of
MVs arose in this topic, particularly in the classification task. Some of the existent
imputation procedures of other fields are adapted to be used in classification, for
example adapting them to deal with qualitative data, while many specific approaches
are proposed.
 
Search WWH ::




Custom Search