Database Reference
In-Depth Information
3.6.2
Locality-Sensitive Families for Jaccard Distance
For the moment, we have only one way to find a family of locality-sensitive functions: use
the family of minhash functions, and assume that the distance measure is the Jaccard dis-
tance. As before, we interpret a minhash function h to make x and y a candidate pair if and
only if h ( x ) = h ( y ).
• The family of minhash functions is a ( d 1 , d 2 , 1 − d 1 , 1 − d 2 )-sensitive family for
any d 1 and d 2 , where 0 ≤ d 1 < d 2 ≤ 1.
The reason is that if d ( x, y ) ≤ d 1 , where d is the Jaccard distance, then SIM( x, y ) = 1 − d ( x,
y ) ≥ 1 − d 1 . But we know that the Jaccard similarity of x and y is equal to the probability
that a minhash function will hash x and y to the same value. A similar argument applies to
d 2 or any distance.
EXAMPLE 3.17 We could let d 1 = 0 . 3 and d 2 = 0 . 6. Then we can assert that the family of
minhash functions is a (0 . 3 , 0 . 6 , 0 . 7 , 0 . 4)-sensitive family. That is, if the Jaccard distance
between x and y is at most 0.3 (i.e., SIM( x, y ) ≥ 0 . 7) then there is at least a 0.7 chance that
a minhash function will send x and y to the same value, and if the Jaccard distance between
x and y is at least 0.6 (i.e., SIM( x, y ) ≤ 0 . 4), then there is at most a 0.4 chance that x and
y will be sent to the same value. Note that we could make the same assertion with another
choice of d 1 and d 2 ; only d 1 < d 2 is required.
3.6.3
Amplifying a Locality-Sensitive Family
Suppose we are given a ( d 1 , d 2 , p 1 , p 2 )-sensitive family F . We can construct a new family
F ′ by the AND-construction on F , which is defined as follows. Each member of F ′ consists
of r members of F for some fixed r . If f is in F ′, and f is constructed from the set { f 1 , f 2 , .
. . , f r } of members of F , we say f ( x ) = f ( y ) if and only if f i ( x ) = f i ( y ) for all i = 1 , 2 , . . . ,
r . Notice that this construction mirrors the effect of the r rows in a single band: the band
makes x and y a candidate pair if every one of the r rows in the band say that x and y are
equal (and therefore a candidate pair according to that row).
Since the members of F are independently chosen to make a member of F ′, we can assert
that F ′ is a ( d 1 , d 2 , ( p 1 ) r , ( p 2 ) r )-sensitive family. That is, for any p , if p is the probability that
a member of F will declare ( x, y ) to be a candidate pair, then the probability that a member
of F ′ will so declare is p r .
There is another construction, which we call the OR-construction , that turns a ( d 1 , d 2 , p 1 ,
p 2 )-sensitive family F into a ( d 1 , d 2 , 1 − (1 − p 1 ) b , 1 − (1 − p 2 ) b )-sensitive family F ′. Each
Search WWH ::




Custom Search