Database Reference
In-Depth Information
! EXERCISE 3.5.9 Prove that the edit distance discussed in Exercise 3.5.8 is indeed a dis-
tance measure.
EXERCISE 3.5.10 Find the Hamming distances between each pair of the following vectors:
000000, 110011, 010101, and 011100.
3.6 The Theory of Locality-Sensitive Functions
The LSH technique developed in Section 3.4 is one example of a family of functions (the
minhash functions) that can be combined (by the banding technique) to distinguish strongly
between pairs at a low distance from pairs at a high distance. The steepness of the S-curve
in Fig. 3.7 reflects how effectively we can avoid false positives and false negatives among
the candidate pairs.
Now, we shall explore other families of functions, besides the minhash functions, that
can serve to produce candidate pairs efficiently. These functions can apply to the space of
sets and the Jaccard distance, or to another space and/or another distance measure. There
are three conditions that we need for a family of functions:
(1) They must be more likely to make close pairs be candidate pairs than distant pairs. We
make this notion precise in Section 3.6.1 .
(2) They must be statistically independent, in the sense that it is possible to estimate the
probability that two or more functions will all give a certain response by the product
rule for independent events.
(3) They must be efficient, in two ways:
(a) They must be able to identify candidate pairs in time much less than the time it
takes to look at all pairs. For example, minhash functions have this capability,
since we can hash sets to minhash values in time proportional to the size of the
data, rather than the square of the number of sets in the data. Since sets with com-
mon values are colocated in a bucket, we have implicitly produced the candidate
pairs for a single minhash function in time much less than the number of pairs of
sets.
(b) They must be combinable to build functions that are better at avoiding false pos-
itives and negatives, and the combined functions must also take time that is much
less than the number of pairs. For example, the banding technique of Section 3.4.1
takes single minhash functions, which satisfy condition (3)(a) but do not, by them-
selves have the S-curve behavior we want, and produces from a number of min-
hash functions a combined function that has the S-curve shape.
Search WWH ::




Custom Search