Sorting High-Dimensional Patterns with Unsupervised Nearest Neighbors - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Sorting High-Dimensional Patterns with Unsupervised

Nearest Neighbors

Oliver Kramer

Department of Computer Science, University of Oldenburg,

Uhlhornsweg 84, 26111 Oldenburg, Germany

oliver.kramer@uni-oldenburg.de

Abstract. In many scientific disciplines structures in high-dimensional data have

to be detected, e.g., in stellar spectra, genome data, or in face recognition tasks. In

this work we present an approach to non-linear dimensionality reduction based on

fitting nearest neighbor regression to the unsupervised regression framework for

learning low-dimensional manifolds. The problem of optimizing latent neighbor-

hoods is difficult to solve, but the unsupervised nearest neighbor (UNN) formula-

tion allows an efficient strategy of iteratively embedding latent points to discrete

neighborhood topologies. The choice of an appropriate loss function is relevant,

in particular for noisy, and high-dimensional data spaces. We extend UNN by the

-insensitive loss, which allows to ignore small residuals under a defined thresh-

old. Furthermore, we introduce techniques to handle incomplete data. Experi-

mental analyses on various artificial and real-world test problems demonstrates

the performance of the approaches.

Keywords: Dimensionality reduction, Unsupervised regression, Nearest neigh-

bors, Robust loss functions, Missing data.

1

Introduction

Dimensionality reduction and manifold learning have an important part to play in the

understanding of data. Many disciplines in science and economy are based on collecting

high-dimensional patterns: from astronomy to psychology, from civil engineering to so-

cial web services. Algorithms are required that are able to process data efficiently. The

collection and understanding of data allows us to improve the efficiency of processes

in a variety of domains. There are numerous examples that reflect the importance of

the understanding of large data sets. The quality of sensors is steadily being improved.

The trend towards digitizing the world leads to large amounts of high-dimensional pat-

terns. For an efficient data analysis process fast dimensionality reduction methods are

required. UNN is a fast iterative approach based on unsupervised regression. The idea

of unsupervised regression is to reverse functional regression models such that low-

dimensional data samples in latent space optimally reconstruct high-dimensional out-

put data. We take this framework as basis for an iterative approach that fits K-nearest

neighbors (KNN) regression into this unsupervised setting.

The manifold problem we consider is a point-wise mapping F : y → x from patterns

y ∈ IR

q with d>q . The problem is a hard optimization

problem as the latent variables X =( x 1 ,..., x N ) are unknown.

d to latent embeddings x ∈ IR

Search WWH ::

Custom Search

Home