Database Reference
In-Depth Information
threshold θ needs to be decided in advance in order to build the signa-
tures. Since the k -th Hamming distance between any data string and
a given query in the worst case could be
, partitioning and enumer-
ation signatures would degenerate to keeping the entire vectors.
|
Λ
|
7.2.6
All-Match Join and Self-join Queries
All signatures can be used to answer join and self-join queries, with
algorithms similar to those discussed for the prefix filter in Sections 6.3
and 6.4, by replacing the full inverted index with a signature based
inverted index. Once again, signatures for similarity functions other
than Hamming have to be constructed using a best case scenario, by uti-
lizing the minimum and maximum possible query string lengths defined
by the corresponding length filters, as discussed in Section 7.2.4.
k
7.2.7
Top-
Join and Self-join Queries
Once again top- k join and self-join queries can be answered by using
the all-match join and self-join algorithms with signature based inverted
indexes instead of full inverted indexes. The algorithms simply identify
any k candidates during initialization, and use the k -th largest Ham-
ming distance as the initial threshold θ . Notice that in the worst case
the initial k -th distance can be
, and the signature based inverted
index will degenerate to a full inverted index.
|
Λ
|
7.3 The Filter Tree
When building a full inverted index on the data we can apply various
filtering techniques to reduce the number of candidate strings exam-
ined. That is, we are able to apply various filtering criteria based on
the specific properties of each similarity function (e.g., norm filtering)
to prune strings while scanning a particular inverted list.
We can generalize this idea by applying the various filtering algo-
rithms, using a filter tree. A filter tree is a combination of trees. For
example, in the first level of the tree we partition strings according
to their L p -norms. In the second level of the tree we could partition
strings according to the tokens contained in their idf sorted prefixes. In
Search WWH ::




Custom Search