Database Reference
In-Depth Information
the column for S 2 in Fig. 3.4 has 0 in the row we are currently considering. The resulting
signature matrix:
Finally, consider the row of Fig. 3.4 numbered 4. h 1 (4) = 0 and h 2 (4) = 3. Since row 4
has 1 only in the column for S 3 , we only compare the current signature column for that set,
[2 , 0] with the hash values [0 , 3]. Since 0 < 2, we change SIG(1 , 3) to 0, but since 3 > 0 we
do not change SIG(2 , 3). The final signature matrix is:
We can estimate the Jaccard similarities of the underlying sets from this signature matrix.
Notice that columns 1 and 4 are identical, so we guess that SIM( S 1 , S 4 ) = 1 . 0. If we look
at Fig. 3.4 , we see that the true Jaccard similarity of S 1 and S 4 is 2/3. Remember that the
fraction of rows that agree in the signature matrix is only an estimate of the true Jaccard
similarity, and this example is much too small for the law of large numbers to assure that
the estimates are close. For additional examples, the signature columns for S 1 and S 3 agree
in half the rows (true similarity 1/4), while the signatures of S 1 and S 2 estimate 0 as their
Jaccard similarity (the correct value).
3.3.6
Exercises for Section 3.3
EXERCISE 3.3.1 Verify the theorem from Section 3.3.3 , which relates the Jaccard similarity
to the probability of minhashing to equal values, for the particular case of Fig. 3.2 .
(a) Compute the Jaccard similarity of each of the pairs of columns in Fig. 3.2 .
! (b) Compute, for each pair of columns of that figure, the fraction of the 120 permuta-
tions of the rows that make the two columns hash to the same value.
EXERCISE 3.3.2 Using the data from Fig. 3.4 , add to the signatures of the columns the val-
ues of the following hash functions:
(a) h 3 ( x ) = 2 x + 4.
(b) h 4 ( x ) = 3 x − 1.
EXERCISE 3.3.3 In Fig. 3.5 is a matrix with six rows.
Search WWH ::




Custom Search