Database Reference
In-Depth Information
in which the i th column of M is replaced by the minhash signature for (the set of) the i th
column.
Note that the signature matrix has the same number of columns as M but only n rows.
Even if M is not represented explicitly, but in some compressed form suitable for a sparse
matrix (e.g., by the locations of its 1's), it is normal for the signature matrix to be much
smaller than M .
3.3.5
Computing Minhash Signatures
It is not feasible to permute a large characteristic matrix explicitly. Even picking a random
permutation of millions or billions of rows is time-consuming, and the necessary sorting of
the rows would take even more time. Thus, permuted matrices like that suggested by Fig.
3.3 , while conceptually appealing, are not implementable.
Fortunately, it is possible to simulate the effect of a random permutation by a random
hash function that maps row numbers to as many buckets as there are rows. A hash function
that maps integers 0 , 1 , . . . , k − 1 to bucket numbers 0 through k − 1 typically will map
some pairs of integers to the same bucket and leave other buckets unfilled. However, the
difference is unimportant as long as k is large and there are not too many collisions. We
can maintain the fiction that our hash function h “permutes” row r to position h ( r ) in the
permuted order.
Thus, instead of picking n random permutations of rows, we pick n randomly chosen
hash functions h 1 , h 2 , . . . , h n on the rows. We construct the signature matrix by considering
each row in their given order. Let SIG( i, c ) be the element of the signature matrix for the
i th hash function and column c . Initially, set SIG( i, c ) to ∞ for all i and c . We handle row r
by doing the following:
(1) Compute h 1 ( r ) , h 2 ( r ) , . . . , h n ( r ).
(2) For each column c do the following:
(a) If c has 0 in row r , do nothing.
(b) However, if c has 1 in row r , then for each i = 1 , 2 , . . . , n set SIG( i, c ) to the
smaller of the current value of SIG( i, c ) and h i ( r ).
EXAMPLE 3.8 Let us reconsider the characteristic matrix of Fig. 3.2 , which we reproduce
with some additional data as Fig. 3.4 . We have replaced the letters naming the rows by in-
tegers 0 through 4. We have also chosen two hash functions: h 1 ( x ) = x + 1 mod 5 and h 2 ( x )
= 3 x + 1 mod 5. The values of these two functions applied to the row numbers are given in
the last two columns of Fig. 3.4 . Notice that these simple hash functions are true permuta-
Search WWH ::




Custom Search