Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

in which the i th column of M is replaced by the minhash signature for (the set of) the i th

column.

Note that the signature matrix has the same number of columns as M but only n rows.

Even if M is not represented explicitly, but in some compressed form suitable for a sparse

matrix (e.g., by the locations of its 1's), it is normal for the signature matrix to be much

smaller than M .

3.3.5

Computing Minhash Signatures

It is not feasible to permute a large characteristic matrix explicitly. Even picking a random

permutation of millions or billions of rows is time-consuming, and the necessary sorting of

the rows would take even more time. Thus, permuted matrices like that suggested by Fig.

3.3 , while conceptually appealing, are not implementable.

Fortunately, it is possible to simulate the effect of a random permutation by a random

hash function that maps row numbers to as many buckets as there are rows. A hash function

that maps integers 0 , 1 , . . . , k − 1 to bucket numbers 0 through k − 1 typically will map

some pairs of integers to the same bucket and leave other buckets unfilled. However, the

difference is unimportant as long as k is large and there are not too many collisions. We

can maintain the fiction that our hash function h “permutes” row r to position h ( r ) in the

permuted order.

Thus, instead of picking n random permutations of rows, we pick n randomly chosen

hash functions h 1 , h 2 , . . . , h n on the rows. We construct the signature matrix by considering

each row in their given order. Let SIG( i, c ) be the element of the signature matrix for the

i th hash function and column c . Initially, set SIG( i, c ) to ∞ for all i and c . We handle row r

by doing the following:

(1) Compute h 1 ( r ) , h 2 ( r ) , . . . , h n ( r ).

(2) For each column c do the following:

(a) If c has 0 in row r , do nothing.

(b) However, if c has 1 in row r , then for each i = 1 , 2 , . . . , n set SIG( i, c ) to the

smaller of the current value of SIG( i, c ) and h i ( r ).

EXAMPLE 3.8 Let us reconsider the characteristic matrix of Fig. 3.2 , which we reproduce

with some additional data as Fig. 3.4 . We have replaced the letters naming the rows by in-

tegers 0 through 4. We have also chosen two hash functions: h 1 ( x ) = x + 1 mod 5 and h 2 ( x )

= 3 x + 1 mod 5. The values of these two functions applied to the row numbers are given in

the last two columns of Fig. 3.4 . Notice that these simple hash functions are true permuta-

Search WWH ::

Custom Search

Home