Database Reference
In-Depth Information
in which the
i
th column of
M
is replaced by the minhash signature for (the set of) the
i
th
column.
Note that the signature matrix has the same number of columns as
M
but only
n
rows.
Even if
M
is not represented explicitly, but in some compressed form suitable for a sparse
matrix (e.g., by the locations of its 1's), it is normal for the signature matrix to be much
smaller than
M
.
3.3.5
Computing Minhash Signatures
It is not feasible to permute a large characteristic matrix explicitly. Even picking a random
permutation of millions or billions of rows is time-consuming, and the necessary sorting of
the rows would take even more time. Thus, permuted matrices like that suggested by
Fig.
Fortunately, it is possible to simulate the effect of a random permutation by a random
hash function that maps row numbers to as many buckets as there are rows. A hash function
that maps integers 0
,
1
, . . . , k
− 1 to bucket numbers 0 through
k
− 1 typically will map
some pairs of integers to the same bucket and leave other buckets unfilled. However, the
difference is unimportant as long as
k
is large and there are not too many collisions. We
can maintain the fiction that our hash function
h
“permutes” row
r
to position
h
(
r
) in the
permuted order.
Thus, instead of picking
n
random permutations of rows, we pick
n
randomly chosen
hash functions
h
1
, h
2
, . . . , h
n
on the rows. We construct the signature matrix by considering
each row in their given order. Let SIG(
i, c
) be the element of the signature matrix for the
i
th hash function and column
c
. Initially, set SIG(
i, c
) to ∞ for all
i
and
c
. We handle row
r
by doing the following:
(1) Compute
h
1
(
r
)
, h
2
(
r
)
, . . . , h
n
(
r
).
(2) For each column
c
do the following:
(a) If
c
has 0 in row
r
, do nothing.
(b) However, if
c
has 1 in row
r
, then for each
i
= 1
,
2
, . . . , n
set SIG(
i, c
) to the
smaller of the current value of SIG(
i, c
) and
h
i
(
r
).
with some additional data as
Fig. 3.4
. We have replaced the letters naming the rows by in-
tegers 0 through 4. We have also chosen two hash functions:
h
1
(
x
) =
x
+ 1 mod 5 and
h
2
(
x
)
= 3
x
+ 1 mod 5. The values of these two functions applied to the row numbers are given in