Database Reference
In-Depth Information
exponent as −(
x
−
q
)
2
. However, for any distance function
d
, we can use as the weight of a
point
x
at distance
d
(
x
,
q
) from the query point
q
the value of
e
−(
d
(
x
−
q
))
2
/
σ
2
Note that this expression makes sense if the data is in some high-dimensional Euclidean
space and
d
is the usual Euclidean distance or Manhattan distance or any other distance
discussed in
Section 3.5.2
.
It also makes sense if
d
is Jaccard distance or any other distance
measure.
However, for Jaccard distance and the other distance measures we considered in
Section
Recall these methods are only approximate, and they could yield false negatives - training
examples that were near neighbors to a query but that do not show up in a search.
If we are willing to accept such errors occasionally, we can build the buckets for the
training set and keep them as the representation of the training set. These buckets are de-
signed so we can retrieve all (or almost all, since there can be false negatives) training-set
points that are have a minimum similarity to a given query
q
. Equivalently, one of the buck-
ets to which the query hashes will contain all those points within some maximum distance
of
q
. We hope that as many nearest neighbors of
q
as our method requires will be found
among those buckets.
Yet if different queries have radically different distances to their nearest neighbors, all is
not lost. We can pick several distances
d
1
<
d
2
<
d
3
< · · ·. Build the buckets for locally-
sensitive hashing using each of these distances. For a query
q
, start with the buckets for
distance
d
1
. If we find enough near neighbors, we are done. Otherwise, repeat the search
using the buckets for
d
2
, and so on, until enough nearest neighbors are found.
12.4.7
Exercises for Section 12.4
EXERCISE
12.4.1 Suppose we modified
Example 12.11
to look at the two nearest neighbors
of a query point
q
. Classify
q
with the common label if those two neighbors have the same
label, and leave
q
unclassified if the labels of the neighbors are different.
(a) Sketch the boundaries of the regions for the three dog breeds on
Fig. 12.21
.
!
(b) Would the boundaries always consist of straight line segments for any training
data?
EXERCISE
12.4.2 Suppose we have the following training set
([1, 2], +1) ([2, 1], −1)