Database Reference
In-Depth Information
3.6.2
Locality-Sensitive Families for Jaccard Distance
For the moment, we have only one way to find a family of locality-sensitive functions: use
the family of minhash functions, and assume that the distance measure is the Jaccard dis-
tance. As before, we interpret a minhash function
h
to make
x
and
y
a candidate pair if and
only if
h
(
x
) =
h
(
y
).
• The family of minhash functions is a (
d
1
, d
2
,
1 −
d
1
,
1 −
d
2
)-sensitive family for
any
d
1
and
d
2
, where 0 ≤
d
1
<
d
2
≤ 1.
The reason is that if
d
(
x, y
) ≤
d
1
, where
d
is the Jaccard distance, then SIM(
x, y
) = 1 −
d
(
x,
y
) ≥ 1 −
d
1
. But we know that the Jaccard similarity of
x
and
y
is equal to the probability
that a minhash function will hash
x
and
y
to the same value. A similar argument applies to
d
2
or any distance.
EXAMPLE
3.17 We could let
d
1
= 0
.
3 and
d
2
= 0
.
6. Then we can assert that the family of
minhash functions is a (0
.
3
,
0
.
6
,
0
.
7
,
0
.
4)-sensitive family. That is, if the Jaccard distance
between
x
and
y
is at most 0.3 (i.e., SIM(
x, y
) ≥ 0
.
7) then there is at least a 0.7 chance that
a minhash function will send
x
and
y
to the same value, and if the Jaccard distance between
x
and
y
is at least 0.6 (i.e., SIM(
x, y
) ≤ 0
.
4), then there is at most a 0.4 chance that
x
and
y
will be sent to the same value. Note that we could make the same assertion with another
choice of
d
1
and
d
2
; only
d
1
<
d
2
is required.
□
3.6.3
Amplifying a Locality-Sensitive Family
Suppose we are given a (
d
1
, d
2
, p
1
, p
2
)-sensitive family
F
. We can construct a new family
F
′ by the
AND-construction
on
F
, which is defined as follows. Each member of
F
′ consists
of
r
members of
F
for some fixed
r
. If
f
is in
F
′, and
f
is constructed from the set {
f
1
, f
2
, .
. . , f
r
} of members of
F
, we say
f
(
x
) =
f
(
y
) if and only if
f
i
(
x
) =
f
i
(
y
) for all
i
= 1
,
2
, . . . ,
r
. Notice that this construction mirrors the effect of the
r
rows in a single band: the band
makes
x
and
y
a candidate pair if every one of the
r
rows in the band say that
x
and
y
are
equal (and therefore a candidate pair according to that row).
Since the members of
F
are independently chosen to make a member of
F
′, we can assert
that
F
′ is a (
d
1
, d
2
,
(
p
1
)
r
,
(
p
2
)
r
)-sensitive family. That is, for any
p
, if
p
is the probability that
a member of
F
will declare (
x, y
) to be a candidate pair, then the probability that a member
of
F
′ will so declare is
p
r
.
There is another construction, which we call the
OR-construction
, that turns a (
d
1
, d
2
, p
1
,
p
2
)-sensitive family
F
into a (
d
1
, d
2
,
1 − (1 −
p
1
)
b
,
1 − (1 −
p
2
)
b
)-sensitive family
F
′. Each