Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

exponent as −( x − q ) 2 . However, for any distance function d , we can use as the weight of a

point x at distance d ( x , q ) from the query point q the value of

e −( d ( x − q )) 2 / σ 2

Note that this expression makes sense if the data is in some high-dimensional Euclidean

space and d is the usual Euclidean distance or Manhattan distance or any other distance

discussed in Section 3.5.2 . It also makes sense if d is Jaccard distance or any other distance

measure.

However, for Jaccard distance and the other distance measures we considered in Section

3.5 we also have the option to use locality-sensitive hashing, the subject of Chapter 3 .

Recall these methods are only approximate, and they could yield false negatives - training

examples that were near neighbors to a query but that do not show up in a search.

If we are willing to accept such errors occasionally, we can build the buckets for the

training set and keep them as the representation of the training set. These buckets are de-

signed so we can retrieve all (or almost all, since there can be false negatives) training-set

points that are have a minimum similarity to a given query q . Equivalently, one of the buck-

ets to which the query hashes will contain all those points within some maximum distance

of q . We hope that as many nearest neighbors of q as our method requires will be found

among those buckets.

Yet if different queries have radically different distances to their nearest neighbors, all is

not lost. We can pick several distances d 1 < d 2 < d 3 < · · ·. Build the buckets for locally-

sensitive hashing using each of these distances. For a query q , start with the buckets for

distance d 1 . If we find enough near neighbors, we are done. Otherwise, repeat the search

using the buckets for d 2 , and so on, until enough nearest neighbors are found.

12.4.7

Exercises for Section 12.4

EXERCISE 12.4.1 Suppose we modified Example 12.11 to look at the two nearest neighbors

of a query point q . Classify q with the common label if those two neighbors have the same

label, and leave q unclassified if the labels of the neighbors are different.

(a) Sketch the boundaries of the regions for the three dog breeds on Fig. 12.21 .

! (b) Would the boundaries always consist of straight line segments for any training

data?

EXERCISE 12.4.2 Suppose we have the following training set

([1, 2], +1) ([2, 1], −1)

Search WWH ::

Custom Search

Home