Database Reference
In-Depth Information
12.4.5
Dealing with High-Dimensional Euclidean Data
We saw in Section 12.4.2 that the two-dimensional case of Euclidean data is fairly easy.
There are several large-scale data structures that have been developed for finding near
neighbors when the number of dimensions grows, and the training set is large. We shall
not cover these structures here, because the subject could fill a topic by itself, and there are
many places available to learn about these techniques, collectively called multidimensional
index structures . The references for this chapter give some of these sources for information
about such structures as kd -Trees, R-Trees, and Quad Trees.
Unfortunately, for high-dimensional data, there is little that can be done to avoid search-
ing a large portion of the data. This fact is another manifestation of the “curse of dimen-
sionality” from Section 7.1.3 . Two ways to deal with the “curse” are the following:
(1) VA Files . Since we must look at a large fraction of the data anyway in order to find
the nearest neighbors of a query point, we could avoid a complex data structure alto-
gether. Accept that we must scan the entire file, but do so in a two-stage manner. First,
a summary of the file is created, using only a small number of bits that approximate
the values of each component of each training vector. For example, if we use only the
high-order (1/4)th of the bits in numerical components, then we can create a file that
is (1/4)th the size of the full dataset. However, by scanning this file we can construct a
list of candidates that might be among the k nearest neighbors of the query q , and this
list may be a small fraction of the entire dataset. We then look up only these candidates
in the complete file, in order to determine which k are nearest to q .
(2) Dimensionality Reduction . We may treat the vectors of the training set as a matrix,
where the rows are the vectors of the training example, and the columns correspond
to the components of these vectors. Apply one of the dimensionality-reduction tech-
niques of Chapter 11 , to compress the vectors to a small number of dimensions, small
enough that the techniques for multidimensional indexing can be used. Of course,
when processing a query vector q , the same transformation must be applied to q before
searching for q 's nearest neighbors.
12.4.6
Dealing with Non-Euclidean Distances
To this point, we have assumed that the distance measure is Euclidean. However, most of
the techniques can be adapted naturally to an arbitrary distance function d . For instance,
in Section 12.4.4 we talked about using a normal distribution as a kernel function. Since
we were thinking about a one-dimensional training set in a Euclidean space, we wrote the
Search WWH ::




Custom Search