Database Reference
In-Depth Information
signatures. Recall that given two strings s, r and corresponding pigeon-
hole signatures PG x ( s ) ,PG x ( r ),
|PG x ( s )
H ( s, r ) ≤ θ implies that
PG x ( r )
x . Answering this query using an inverted index is reminis-
cent of using the multiway merge , threshold and heaviest first
algorithms to answer intersection queries on strings (with unit token
weights). Indeed, we can use essentially the same algorithms to evaluate
the intersection between pigeonhole signatures using the inverted index.
The same ideas apply for answering queries using the prefix enumera-
tion signature. The resulting inverted index contains one list per hash
value, for all hash values contained in the signatures of the data strings.
Nevertheless, we cannot straightforwardly use any of these signa-
tures to answer all-match selection queries for any of the other similarity
measures, given that the derived upper bounds given in Lemma 7.11,
depend on the lengths of both the data and the query string. Since
for arbitrary queries the lengths of the query strings are not known
in advance, constructing a meaningful signature for each data string a
priori is not trivial. In order to build such signatures, we use the length
filters (i.e., L p -norm filters for non-uniform weights) derived for each
similarity function.
Simply stated, given a particular similarity function (other than
Hamming), a similarity threshold θ , and the associated L p -norm fil-
ter, we know that for a given data string, the only query strings that
can have similarity larger than θ have to fall within the norm interval
specified by the norm filter. Hence, when constructing the signature
of the data string, we can simply derive an upper bound on the tar-
get Hamming distance that will guarantee no false dismissals by using
the minimum and maximum filtering lengths. Clearly, since Weighted
Intersection is not based on string norms, we cannot use these signa-
tures to evaluate Weighted Intersection similarity.
In particular, the following upper bounds are easily derived.
|≥
Lemma 7.12. Assuming a uniform weight function W ( λ )= c, λ
Λ,
given two strings s, r
Normalized Weighted Intersection:
1
θ
2 θ
N
( s, r )
θ
⇒H
( s, r )
+1
|
r
|
.
 
Search WWH ::




Custom Search