Databases Reference
In-Depth Information
(a) Compute the Jaccard similarity of each of the pairs of columns in Fig. 3.2.
! (b) Compute, for each pair of columns of that figure, the fraction of the 120
permutations of the rows that make the two columns hash to the same
value.
Exercise 3.3.2 : Using the data from Fig. 3.4, add to the signatures of the
columns the values of the following hash functions:
(a) h 3 (x) = 2x + 4.
•h 4 (x) = 3x−1.
Element
S 1
S 2
S 3
S 4
0
0
1
0
1
1
0
1
0
0
2
1
0
0
1
3
0
0
1
0
4
0
0
1
1
5
1
0
0
0
Figure 3.5: Matrix for Exercise 3.3.3
Exercise 3.3.3 : In Fig. 3.5 is a matrix with six rows.
(a) Compute the minhash signature for each column if we use the following
three hash functions:
h 1 (x) = 2x + 1
mod 6; h 2 (x) = 3x + 2
mod 6;
h 3 (x) = 5x + 2
mod 6.
(b) Which of these hash functions are true permutations?
(c) How close are the estimated Jaccard similarities for the six pairs of columns
to the true Jaccard similarities?
! Exercise 3.3.4 : Now that we know Jaccard similarity is related to the proba-
bility that two sets minhash to the same value, reconsider Exercise 3.1.3. Can
you use this relationship to simplify the problem of computing the expected
Jaccard similarity of randomly chosen sets?
! Exercise 3.3.5 : Prove that if the Jaccard similarity of two columns is 0, then
then minhashing always gives a correct estimate of the Jaccard similarity.
!! Exercise 3.3.6 : One might expect that we could estimate the Jaccard simi-
larity of columns without using all possible permutations of rows. For example,
we could only allow cyclic permutations; i.e., start at a randomly chosen row
r, which becomes the first in the order, followed by rows r + 1, r + 2, and so
Search WWH ::




Custom Search