Database Reference
In-Depth Information
and hashes them to some large number of buckets. We can use the same hash function for
all the bands, but we use a separate bucket array for each band, so columns with the same
vector in different bands will not hash to the same bucket.
EXAMPLE 3.10 Figure 3.6 shows part of a signature matrix of 12 rows divided into four
bands of three rows each. The second and fourth of the explicitly shown columns each have
the column vector [0 , 2 , 1] in the first band, so they will definitely hash to the same buck-
et in the hashing for the first band. Thus, regardless of what those columns look like in
the other three bands, this pair of columns will be a candidate pair. It is possible that other
columns, such as the first two shown explicitly, will also hash to the same bucket accord-
ing to the hashing of the first band. However, since their column vectors are different, [1 ,
3 , 0] and [0 , 2 , 1], and there are many buckets for each hashing, we expect the chances of
an accidental collision to be very small. We shall normally assume that two vectors hash to
the same bucket if and only if they are identical.
Figure 3.6 Dividing a signature matrix into four bands of three rows per band
Two columns that do not agree in band 1 have three other chances to become a candidate
pair; they might be identical in any one of these other bands. However, observe that the
more similar two columns are, the more likely it is that they will be identical in some band.
Thus, intuitively the banding strategy makes similar columns much more likely to be can-
didate pairs than dissimilar pairs.
3.4.2
Analysis of the Banding Technique
Suppose we use b bands of r rows each, and suppose that a particular pair of documents
have Jaccard similarity s . Recall from Section 3.3.3 that the probability the minhash sig-
natures for these documents agree in any one particular row of the signature matrix is s .
We can calculate the probability that these documents (or rather their signatures) become a
candidate pair as follows:
(1) The probability that the signatures agree in all rows of one particular band is s r .
Search WWH ::




Custom Search