Database Reference
In-Depth Information
Fig. 7.3 If two vectors differ in all four hash values σ i , then they differ in at least one bit
within each partition, hence their Hamming distance is larger than θ =3.
the concatenated string, such that the partition number information is
encoded in the hash value. In that case, two strings are within Ham-
ming distance θ if and only if they have at least one hash value in
common.
Lemma 7.8. Given strings s, r and their corresponding pigeonhole
signatures PG 1 ( s ) ,PG 1 ( r )
PG 1 ( s )
PG 1 ( r )
H
( s, r )
θ
=
.
Expressing buckets as bit sequences in order to compute the hash value
of each bucket is not practical, since it implies the need to actually
instantiate the sparse vectors corresponding to data strings (depend-
ing on the dimensionality of the vectors and the hash function used,
buckets can have extremely large bit sequence representations). A more
practical alternative is to hash a compact sparse representation of each
bucket. This can be accomplished easily by concatenating and hashing
only the indices of those coordinates that are set to 1.
The filtering effectiveness of signature PG 1 is affected by two fac-
tors. First, it is not unlikely for two strings to agree in a given partition
by pure chance resulting in a false positive. Second, in practice the vec-
tor dimensionality
is very large with respect to the average string
length. This implies that the resulting vectors are very sparse. Hence,
in expectation, a large number of strings will contain empty partitions
(i.e., partitions with bit-vectors consisting only of zeros). Pairs of strings
that share empty partitions need to be reported as candidates, which
might lead to an increased number of false positives. It is tempting to
consider pruning all pairs of strings that agree only on empty parti-
tions. Nevertheless, this would lead to false dismissals, as can be easily
|
Λ
|
 
Search WWH ::




Custom Search