Database Reference
In-Depth Information
signature matrix:
has 1 only in the column for
S
3
, we only compare the current signature column for that set,
[2
,
0] with the hash values [0
,
3]. Since 0 < 2, we change SIG(1
,
3) to 0, but since 3 > 0 we
do not change SIG(2
,
3). The final signature matrix is:
We can estimate the Jaccard similarities of the underlying sets from this signature matrix.
Notice that columns 1 and 4 are identical, so we guess that SIM(
S
1
, S
4
) = 1
.
0. If we look
fraction of rows that agree in the signature matrix is only an estimate of the true Jaccard
similarity, and this example is much too small for the law of large numbers to assure that
the estimates are close. For additional examples, the signature columns for
S
1
and
S
3
agree
in half the rows (true similarity 1/4), while the signatures of
S
1
and
S
2
estimate 0 as their
Jaccard similarity (the correct value).
□
3.3.6
Exercises for Section 3.3
EXERCISE
3.3.1 Verify the theorem from
Section 3.3.3
,
which relates the Jaccard similarity
!
(b) Compute, for each pair of columns of that figure, the fraction of the 120 permuta-
tions of the rows that make the two columns hash to the same value.
ues of the following hash functions:
(a)
h
3
(
x
) = 2
x
+ 4.
(b)
h
4
(
x
) = 3
x
− 1.
EXERCISE
3.3.3 In
Fig. 3.5
is a matrix with six rows.