Database Reference
In-Depth Information
s . Furthermore, since all remaining strings r in the event queue are
also processed in decreasing order of J
of
J
π r , no string r will ever need to
L ( λ π ) after the point where the scan has stopped,
hence we can delete from L ( λ π ) all remaining strings to save space.
A drawback of this algorithm is that it might have to compute
the exact similarity between the same pair of strings as many times
as the number of common tokens between those strings. An obvious
optimization is to insert the similarity of each pair into a hash table
the first time it is computed. But this would result in a potentially large
hash table. Instead, a simple improvement is to first compute whether
a pair will be identified a second time, the first time its similarity is
computed, and insert it in the hash table only if necessary. This is
possible, since it is easy to compute an upper bound on the similarity
of the pair and hence a worst case prefix length that will have to be
examined. We simply compute whether the two strings have more than
one token in common within this worst case prefix.
access any string s
7.2 Partitioning and Enumeration Signatures
Prefix filtering is based on defining a total order of tokens in Λ. A funda-
mentally different approach is based on permuting the tokens instead.
After tokens have been randomly permuted, we either partition tokens
and hash each partition to create a signature based on the pigeonhole
principle, or enumerate all possible combinations of a certain number
of tokens, based on the prefix principle, and hash the resulting combi-
nations to create a signature.
7.2.1
The Pigeonhole Signature
Consider a random permutation of tokens in Λ. Such a permutation in
practice can be imposed by using a hash function h
N
(practical
hash functions cannot produce a truly uniform permutation of tokens,
but in reality the permutation imposed by these hash functions is good
enough for applications).
Partitioning signatures are based on the pigeonhole principle. For
simplicity, consider strings as sets, and let a string s be conceptually
represented as a vector of dimensionality Λ, with magnitude W ( λ i )
Search WWH ::




Custom Search