Information Technology Reference
In-Depth Information
5
Missing Data
Failures of sensors, matching of databases with disjunct feature sets, or conditions
where data can get lost (e.g., in outer space due to X-ray) are typical examples for
practical scenarios, in which data sets are incomplete. Sensors may be out of order,
or data may get lost. However, it might be desirable to compute a latent embedding
of high-dimensional data. In this section we will introduce strategies that allow unsu-
pervised nearest neighbor regression to cope with missing data. The question arises,
if the embedding approach can exploit useful structural information to reconstruct the
missing entries. Experimental analyses will answer these questions.
5.1
Imputation Methods
For the case that the distribution of missingness is conditionally independent of the
missing values given the observed data, the data is denoted as missing at random 1 .
Schafer and Graham [24] have reviewed methods to handle them. In case of scarce data
sets joint densities can be computed in a probabilistic framework [9].
If possible, the method can directly deal with missing data (our embed-and-repair
method that will be introduced in Section 5.3 belongs to this class). For SVM clas-
sification such an approach has been introduced by Chechik et al. [6] that alters the
SVM margin interpretation to directly deal with incomplete patterns. But the method
is best suited for features that are absent than those that are MNAR. An extension has
been proposed by Dick et al. [7] who marginalize kernels over the assumed imputation
distribution. The approach by Williams et al. [29] employs logistic regression for clas-
sification of incomplete data, and performs an analytic integration with an estimated
conditional density function instead of imputation. The approach is interesting, as it
does not only take into account the complete patterns, but also the incomplete patterns
in a semi-supervised kind of way.
5.2
Repair-and-Embed
Let Y be the matrix of high-dimensional patterns. In the missing data scenario we
assume that some patterns are incomplete, i.e., for one entry of y j it holds y ij = n.a.
We can treat the problem of missing entries as regression approach. First, we define Y
as the matrix of complete patterns, i.e., it holds
Y is the
y ij = n.a. In contrast, Y \
Y repair-and-embed trains a regression
model f based on Y . We propose to first fill the vectors y j from Y with minimal number
of missing values, and add the completed patterns to Y for repairing the next vectors
with minimal number of missing entries in an iterative kind of way.
Let y ij be the entry to complete. We can employ matrix Y −i as training pattern 2 ,
while y i = y i 1 ,...,y i N
matrix of incomplete pattern. To complete Y \
comprises the corresponding labels. Entry y ij is estimated
1
Missing at random (MAR) means that entries are missing randomly with uniform distribution,
in contrast to missing not at random (MNAR), where dependencies exist, e.g., the missingness
depends on certain distributions.
Y −i =(( y 1 ) −i ,..., ( y N ) −i ) with ( y k ) −i =( y ) k with k =1 ,...,d and k = i .
2
 
Search WWH ::




Custom Search