Database Reference
In-Depth Information
exponent as −( x q ) 2 . However, for any distance function d , we can use as the weight of a
point x at distance d ( x , q ) from the query point q the value of
e −( d ( x q )) 2 / σ 2
Note that this expression makes sense if the data is in some high-dimensional Euclidean
space and d is the usual Euclidean distance or Manhattan distance or any other distance
discussed in Section 3.5.2 . It also makes sense if d is Jaccard distance or any other distance
measure.
However, for Jaccard distance and the other distance measures we considered in Section
3.5 we also have the option to use locality-sensitive hashing, the subject of Chapter 3 .
Recall these methods are only approximate, and they could yield false negatives - training
examples that were near neighbors to a query but that do not show up in a search.
If we are willing to accept such errors occasionally, we can build the buckets for the
training set and keep them as the representation of the training set. These buckets are de-
signed so we can retrieve all (or almost all, since there can be false negatives) training-set
points that are have a minimum similarity to a given query q . Equivalently, one of the buck-
ets to which the query hashes will contain all those points within some maximum distance
of q . We hope that as many nearest neighbors of q as our method requires will be found
among those buckets.
Yet if different queries have radically different distances to their nearest neighbors, all is
not lost. We can pick several distances d 1 < d 2 < d 3 < · · ·. Build the buckets for locally-
sensitive hashing using each of these distances. For a query q , start with the buckets for
distance d 1 . If we find enough near neighbors, we are done. Otherwise, repeat the search
using the buckets for d 2 , and so on, until enough nearest neighbors are found.
12.4.7
Exercises for Section 12.4
EXERCISE 12.4.1 Suppose we modified Example 12.11 to look at the two nearest neighbors
of a query point q . Classify q with the common label if those two neighbors have the same
label, and leave q unclassified if the labels of the neighbors are different.
(a) Sketch the boundaries of the regions for the three dog breeds on Fig. 12.21 .
! (b) Would the boundaries always consist of straight line segments for any training
data?
EXERCISE 12.4.2 Suppose we have the following training set
([1, 2], +1) ([2, 1], −1)
Search WWH ::




Custom Search