Database Reference
In-Depth Information
Lemma 8.6. Given a query string v and a string interval [ s i ,s j ], the
edit distance between v and any string s
[ s i ,s j ] is lower bounded by
the number of mismatches from v to LCP ( s i ,s j ).
To get a tighter lower bound on the edit distance estimation we
can also reverse the process by counting the mismatched positional q -
grams from LCP ( s i ,s j )to v . After counting the mismatches on both
directions, the larger one will be returned as the lower bound value. The
algorithm takes linear time with respect to the string lengths, since all
positional q -grams are sorted on positions.
On the property of pairwise lower bounding, similar techniques can
be applied by constructing the longest common prefix for both string
intervals [ s l ,s u ] and [ r l ,r u ]. Again, the number of mismatches from both
positional q -gram sets is counted, the larger of which is the estimated
lower bound.
8.4 Discussion and Related Issues
The state-of-the-art filtering algorithm for edit distance computation
is the mismatch filter. It can prune strings both based on the loca-
tions of matching q -grams and mismatching q -grams, which leads to
tighter pruning than all other approaches. Filtering algorithms cannot
be adapted to evaluate either weighted or normalized edit distance. For
selection queries and datasets that fit in maim memory, the trie based
algorithm is a very ecient approach, especially for computing edit
distance when query strings are given incrementally, as is the case with
interactive applications, since they reuse computation performed in pre-
vious steps. Trie based algorithms can be used to evaluate weighted edit
distance. The B-tree algorithms are the best alternative when consid-
ering incremental updates. They also have the benefit that the same
B-tree structure can be used to answer all types of queries (selections,
joins, all-match, and top- k queries), without the need to specify a min-
imum query threshold a priori. They are also the most ecient for long
strings and large edit distance thresholds. The performance of all other
algorithms deteriorates significantly for large thresholds. Finally, the
 
Search WWH ::




Custom Search