Databases Reference
In-Depth Information
3. They must be e cient, in two ways:
(a) They must be able to identify candidate pairs in time much less
than the time it takes to look at all pairs. For example, minhash
functions have this capability, since we can hash sets to minhash
values in time proportional to the size of the data, rather than the
square of the number of sets in the data. Since sets with common
values are colocated in a bucket, we have implicitly produced the
candidate pairs for a single minhash function in time much less than
the number of pairs of sets.
(b) They must be combinable to build functions that are better at avoid-
ing false positives and negatives, and the combined functions must
also take time that is much less than the number of pairs. For ex-
ample, the banding technique of Section 3.4.1 takes single minhash
functions, which satisfy condition 3a but do not, by themselves have
the S-curve behavior we want, and produces from a number of min-
hash functions a combined function that has the S-curve shape.
Our first step is to define “locality-sensitive functions” generally. We then
see how the idea can be applied in several applications. Finally, we discuss
how to apply the theory to arbitrary data with either a cosine distance or a
Euclidean distance measure.
3.6.1 Locality-Sensitive Functions
For the purposes of this section, we shall consider functions that take two items
and render a decision about whether these items should be a candidate pair.
In many cases, the function f will “hash” items, and the decision will be based
on whether or not the result is equal. Because it is convenient to use the
notation f (x) = f (y) to mean that f (x, y) is “yes; make x and y a candidate
pair,” we shall use f (x) = f (y) as a shorthand with this meaning. We also use
f (x) = f (y) to mean “do not make x and y a candidate pair unless some other
function concludes we should do so.”
A collection of functions of this form will be called a family of functions.
For example, the family of minhash functions, each based on one of the possible
permutations of rows of a characteristic matrix, form a family.
Let d 1 < d 2 be two distances according to some distance measure d. A
family F of functions is said to be (d 1 , d 2 , p 1 , p 2 )-sensitive if for every f in F:
1. If d(x, y)≤d 1 , then the probability that f (x) = f (y) is at least p 1 .
2. If d(x, y)≥d 2 , then the probability that f (x) = f (y) is at most p 2 .
Figure 3.9 illustrates what we expect about the probability that a given
function in a (d 1 , d 2 , p 1 , p 2 )-sensitive family will declare two items to be a can-
didate pair. Notice that we say nothing about what happens when the distance
Search WWH ::




Custom Search