Sorting High-Dimensional Patterns with Unsupervised Nearest Neighbors - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

5

Missing Data

Failures of sensors, matching of databases with disjunct feature sets, or conditions

where data can get lost (e.g., in outer space due to X-ray) are typical examples for

practical scenarios, in which data sets are incomplete. Sensors may be out of order,

or data may get lost. However, it might be desirable to compute a latent embedding

of high-dimensional data. In this section we will introduce strategies that allow unsu-

pervised nearest neighbor regression to cope with missing data. The question arises,

if the embedding approach can exploit useful structural information to reconstruct the

missing entries. Experimental analyses will answer these questions.

5.1

Imputation Methods

For the case that the distribution of missingness is conditionally independent of the

missing values given the observed data, the data is denoted as missing at random 1 .

Schafer and Graham [24] have reviewed methods to handle them. In case of scarce data

sets joint densities can be computed in a probabilistic framework [9].

If possible, the method can directly deal with missing data (our embed-and-repair

method that will be introduced in Section 5.3 belongs to this class). For SVM clas-

sification such an approach has been introduced by Chechik et al. [6] that alters the

SVM margin interpretation to directly deal with incomplete patterns. But the method

is best suited for features that are absent than those that are MNAR. An extension has

been proposed by Dick et al. [7] who marginalize kernels over the assumed imputation

distribution. The approach by Williams et al. [29] employs logistic regression for clas-

sification of incomplete data, and performs an analytic integration with an estimated

conditional density function instead of imputation. The approach is interesting, as it

does not only take into account the complete patterns, but also the incomplete patterns

in a semi-supervised kind of way.

5.2

Repair-and-Embed

Let Y be the matrix of high-dimensional patterns. In the missing data scenario we

assume that some patterns are incomplete, i.e., for one entry of y j it holds y ij = n.a.

We can treat the problem of missing entries as regression approach. First, we define Y

as the matrix of complete patterns, i.e., it holds

Y is the

y ij = n.a. In contrast, Y \

Y repair-and-embed trains a regression

model f based on Y . We propose to first fill the vectors y j from Y with minimal number

of missing values, and add the completed patterns to Y for repairing the next vectors

with minimal number of missing entries in an iterative kind of way.

Let y ij be the entry to complete. We can employ matrix Y −i as training pattern 2 ,

while y i = y i 1 ,...,y i N

matrix of incomplete pattern. To complete Y \

comprises the corresponding labels. Entry y ij is estimated

1

Missing at random (MAR) means that entries are missing randomly with uniform distribution,

in contrast to missing not at random (MNAR), where dependencies exist, e.g., the missingness

depends on certain distributions.

Y −i =(( y 1 ) −i ,..., ( y N ) −i ) with ( y k ) −i =( y ) k with k =1 ,...,d and k = i .

2

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home