Databases Reference
In-Depth Information
3.3.4
Minhash Signatures
Again think of a collection of sets represented by their characteristic matrix M .
To represent sets, we pick at random some number n of permutations of the
rows of M . Perhaps 100 permutations or several hundred permutations will do.
Call the minhash functions determined by these permutations h 1 , h 2 , . . . , h n .
From the column representing set S, construct the minhash signature for S, the
vector [h 1 (S), h 2 (S), . . . , h n (S)]. We normally represent this list of hash-values
as a column. Thus, we can form from matrix M a signature matrix, in which
the ith column of M is replaced by the minhash signature for (the set of) the
ith column.
Note that the signature matrix has the same number of columns as M but
only n rows. Even if M is not represented explicitly, but in some compressed
form suitable for a sparse matrix (e.g., by the locations of its 1's), it is normal
for the signature matrix to be much smaller than M .
3.3.5
Computing Minhash Signatures
It is not feasible to permute a large characteristic matrix explicitly. Even picking
a random permutation of millions or billions of rows is time-consuming, and
the necessary sorting of the rows would take even more time. Thus, permuted
matrices like that suggested by Fig. 3.3, while conceptually appealing, are not
implementable.
Fortunately, it is possible to simulate the effect of a random permutation by
a random hash function that maps row numbers to as many buckets as there
are rows. A hash function that maps integers 0, 1, . . . , k−1 to bucket numbers
0 through k−1 typically will map some pairs of integers to the same bucket and
leave other buckets unfilled. However, the difference is unimportant as long as
k is large and there are not too many collisions. We can maintain the fiction
that our hash function h “permutes” row r to position h(r) in the permuted
order.
Thus, instead of picking n random permutations of rows, we pick n randomly
chosen hash functions h 1 , h 2 , . . . , h n on the rows. We construct the signature
matrix by considering each row in their given order. Let SIG (i, c) be the element
of the signature matrix for the ith hash function and column c. Initially, set
SIG (i, c) to∞for all i and c. We handle row r by doing the following:
1. Compute h 1 (r), h 2 (r), . . . , h n (r).
2. For each column c do the following:
(a) If c has 0 in row r, do nothing.
(b) However, if c has 1 in row r, then for each i = 1, 2, . . . , n set SIG (i, c)
to the smaller of the current value of SIG (i, c) and h i (r).
Search WWH ::




Custom Search