Database Reference
In-Depth Information
7.2.3
The Prefix Enumeration Signature
Extending the pigeonhole and partenum signatures for arbitrary
weights is non-trivial. A naive approach would be to represent each
token λ i of string s using W ( λ i ) copies (appropriate rounding tech-
niques can be used for real valued weights), and essentially converting
a weighted set of tokens into an unweighted bag and adversely increas-
ing the vector dimensionality. The pigeonhole signature cannot be used
for arbitrary weights in practice.
A different approach is to use ideas both from the pigeonhole and
prefix signatures to produce a weighted signature for the string. Assum-
ing Weighted Intersection similarity and query threshold θ , the idea is
to enumerate all minimal subsets of tokens of string s whose total
weight adds up to θ , and then hash each one of these subsets into an
integer using a hash function h (a minimal subset is a set that has no
proper subset with total weight larger than θ ). The signature of the
string is the set of resulting hash values. Clearly, if I ( s, r ) ≥ θ then s
and r have to have at least one of the enumerated subsets of tokens in
common, resulting in a non-empty signature intersection.
The obvious drawback of this signature scheme is that for very
long strings the number of minimal token subsets exceeding the query
threshold will tend to be very large (especially for small thresholds),
resulting in a very large signature. To decrease the size of the signature
we can order every subset of tokens in decreasing weight order and hash
only a prefix of each subset (e.g., we hash all minimal prefixes with total
weight ≥ θ , where θ controls the size of the signature). It is expected
that a large number of minimal subsets will share the same minimal
prefix resulting in exactly the same hash value, and hence reducing the
size of the signature. The drawback of course is that the new signature
results in false positives. Also, the resulting signatures do not have a
fixed size; the size heavily depends on the contents of a string. We call
this signature the prefix enumeration signature.
7.2.4
All-Match Selection Queries
For evaluating Hamming distance, we can use an inverted index to
answer queries using either the pigeonhole or the prefix enumeration
Search WWH ::




Custom Search