Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

3.7.5

More LSH Families for Euclidean Spaces

There is something unsatisfying about the family of hash functions developed in Section

3.7.4 . First, the technique was only described for two-dimensional Euclidean spaces. What

happens if our data is points in a space with many dimensions? Second, for Jaccard and co-

sine distances, we were able to develop locality-sensitive families for any pair of distances

d 1 and d 2 as long as d 1 < d 2 . In Section 3.7.4 we appear to need the stronger condition d 1 <

4 d 2 .

However, we claim that there is a locality-sensitive family of hash functions for any d 1 <

d 2 and for any number of dimensions. The family's hash functions still derive from random

lines through the space and a bucket size a that partitions the line. We still hash points by

projecting them onto the line. Given that d 1 < d 2 , we may not know what the probability

p 1 is that two points at distance d 1 hash to the same bucket, but we can be certain that it is

greater than p 2 , the probability that two points at distance d 2 hash to the same bucket. The

reason is that this probability surely grows as the distance shrinks. Thus, even if we cannot

calculate p 1 and p 2 easily, we know that there is a ( d 1 , d 2 , p 1 , p 2 )-sensitive family of hash

functions for any d 1 < d 2 and any given number of dimensions.

Using the amplification techniques of Section 3.6.3 , we can then adjust the two probab-

ilities to surround any particular value we like, and to be as far apart as we like. Of course,

the further apart we want the probabilities to be, the larger the number of basic hash func-

tions in F we must use.

3.7.6

Exercises for Section 3.7

EXERCISE 3.7.1 Suppose we construct the basic family of six locality-sensitive functions

for vectors of length six. For each pair of the vectors 000000, 110011, 010101, and 011100,

which of the six functions makes them candidates?

EXERCISE 3.7.2 Let us compute sketches using the following four “random” vectors:

v 1 = [+1 , +1 , +1 , −1]

v 2 = [+1 , +1 , −1 , +1]

v 3 = [+1 , −1 , +1 , +1]

v 4 = [−1 , +1 , +1 , +1]

Compute the sketches of the following vectors.

(a) [2 , 3 , 4 , 5].

(b) [−2 , 3 , −4 , 5].

(c) [2 , −3 , 4 , −5].

Search WWH ::

Custom Search

Home