Database Reference
In-Depth Information
weights, it has larger size on average than the prefix signature, but it
results in tighter pruning. The filter tree has proved to be very effective
for answering selection queries, but its performance depends heavily on
the order in which filters are composed. It is dicult to compare fil-
tering techniques with algorithms based on full inverted indexes. The
relative performance depends on whether the indexes can fit in main
memory and also on the query and data characteristics. There exists
an inherent tradeoff between using the index to compute the exact sim-
ilarity between the query and data strings, over using the index as a
filtering step that requires a subsequent verification step.
7.6 Related Work
A first instantiation of the prefix principle was proposed by Sarawagi
and Kirpal [60]. The prefix filter was later formalized by Chaudhuri
et al. [19]. Xiao et al. [75] proposed improved pruning conditions for
Jaccard, Dice, and Cosine similarity for the prefix filter, assuming unit
token weights. The extensions to arbitrary weights discussed here are
straightforward. Chang and Lawler [17] used the idea of partitioning
the strings using the pigeonhole principle for computing both Hamming
distance and edit distance. The same partitioning strategy was used by
Wu and Manber [72] for the agrep utility. The prefix enumeration and
partenum signatures were proposed by Arasu et al. [6].
Bayardo et al. [10] proposed various join algorithms and optimiza-
tions related to building and indexing prefix signatures, and using
advanced list merging algorithms to evaluate join queries, similar to
the ones described in Section 6.3. Xiao et al. [75] extended the algo-
rithms by Bayardo et al. to access parts of the sux of strings as
well in order to improve filtering eciency, assuming that datasets fit
in main memory. Top- k join queries were extensively studied by Xiao
et al. [74], assuming that the dataset fits in main memory. In addition,
Xiao et al. [74] discussed various optimizations specifically tailored for
unit weights. The filter tree was proposed by Li et al. [48].
Search WWH ::




Custom Search