Database Reference
In-Depth Information
Figure 3.5 Matrix for Exercise 3.3.3
(a) Compute the minhash signature for each column if we use the following three hash
functions: h 1 ( x ) = 2 x +1 mod 6; h 2 ( x ) = 3 x +2 mod 6; h 3 ( x ) = 5 x +2 mod 6.
(b) Which of these hash functions are true permutations?
(c) How close are the estimated Jaccard similarities for the six pairs of columns to the true
Jaccard similarities?
! EXERCISE 3.3.4 Now that we know Jaccard similarity is related to the probability that two
sets minhash to the same value, reconsider Exercise 3.1.3 . Can you use this relationship
to simplify the problem of computing the expected Jaccard similarity of randomly chosen
sets?
! EXERCISE 3.3.5 Prove that if the Jaccard similarity of two columns is 0, then then min-
hashing always gives a correct estimate of the Jaccard similarity.
!! EXERCISE 3.3.6 One might expect that we could estimate the Jaccard similarity of
columns without using all possible permutations of rows. For example, we could only al-
low cyclic permutations; i.e., start at a randomly chosen row r , which becomes the first in
the order, followed by rows r + 1, r + 2, and so on, down to the last row, and then con-
tinuing with the first row, second row, and so on, down to row r − 1. There are only n such
permutations if there are n rows. However, these permutations are not sufficient to estimate
the Jaccard similarity correctly. Give an example of a two-column matrix where averaging
over all the cyclic permutations does not give the Jaccard similarity.
! EXERCISE 3.3.7 Suppose we want to use a MapReduce framework to compute minhash
signatures. If the matrix is stored in chunks that correspond to some columns, then it is
quite easy to exploit parallelism. Each Map task gets some of the columns and all the hash
functions, and computes the minhash signatures of its given columns. However, suppose
the matrix were chunked by rows, so that a Map task is given the hash functions and a set
of rows to work on. Design Map and Reduce functions to exploit MapReduce with data in
this form.
Search WWH ::




Custom Search