Databases Reference
In-Depth Information
items are. We then consider any pair that hashed to the same bucket for any
of the hashings to be a candidate pair. We check only the candidate pairs for
similarity. The hope is that most of the dissimilar pairs will never hash to the
same bucket, and therefore will never be checked. Those dissimilar pairs that
do hash to the same bucket are false positives; we hope these will be only a
small fraction of all pairs.. We also hope that most of the truly similar pairs
will hash to the same bucket under at least one of the hash functions. Those
that do not are false negatives; we hope these will be only a small fraction of
the truly similar pairs.
If we have minhash signatures for the items, an effective way to choose the
hashings is to divide the signature matrix into b bands consisting of r rows
each. For each band, there is a hash function that takes vectors of r integers
(the portion of one column within that band) and hashes them to some large
number of buckets. We can use the same hash function for all the bands, but
we use a separate bucket array for each band, so columns with the same vector
in different bands will not hash to the same bucket.
1 0 0 0 2
3 2 1 2 2
0 1 3 1 1
. . .
. . .
band 1
band 2
band 3
band 4
Figure 3.6: Dividing a signature matrix into four bands of three rows per band
Example 3.10 : Figure 3.6 shows part of a signature matrix of 12 rows divided
into four bands of three rows each. The second and fourth of the explicitly
shown columns each have the column vector [0, 2, 1] in the first band, so they
will definitely hash to the same bucket in the hashing for the first band. Thus,
regardless of what those columns look like in the other three bands, this pair
of columns will be a candidate pair. It is possible that other columns, such as
the first two shown explicitly, will also hash to the same bucket according to
the hashing of the first band. However, since their column vectors are different,
[1, 3, 0] and [0, 2, 1], and there are many buckets for each hashing, we expect the
chances of an accidental collision to be very small. We shall normally assume
that two vectors hash to the same bucket if and only if they are identical.
Two columns that do not agree in band 1 have three other chances to become
a candidate pair; they might be identical in any one of these other bands.
Search WWH ::




Custom Search