Database Reference
In-Depth Information
is used in order to prune both based on the q -gram hashes and the
positions simultaneously).
The mapping function of gram location order is formally defined as
the dictionary order on the positional q -gram set:
φ gl ( s )= φ d ( s ) .
(8.6)
After transforming every string to the top- L positional q -gram sets,
the property of comparability is guaranteed due to the property of
dictionary order. In the following, we analyze the property of lower
bounding and pairwise lower bounding.
Similar to dictionary order, given a string interval [ s i ,s j ] in gram
location order, all strings in this interval share a common prefix
LCP ( s i ,s j ) of positional q -grams. If a query string v is also represented
by a positional q -gram set v , the lower bound on the edit distance from
v to any string s
[ s i ,s j ] can be estimated by counting the number of
mismatched positional q -grams between v and LCP ( s i ,s j ). For that
purpose we can use the mismatch filter condition from Lemma 8.4.
Definition 8.2(Positional Mismatch). Given edit distance thresh-
old θ and two positional q -gram sets v and s , a positional q -gram
( h i ,p i )
v incurs a mismatch if either holds: 1. For all x ,no( h x ,p x )
s exists s.t. h x = h i and
p x
p i |≤
θ ; 2. No mismatching ( h j ,p j ) has
|
been found already, s.t. p i
p j
θ for any p j <p i .
The first condition checks for potential matches between the same
q -grams in v and s within distance θ . The second condition checks
whether there exists a mismatching q -gram preceding the current q -
gram within distance θ from the position of the current q -gram. If such
a mismatch exists, then it might be possible, in a best case scenario,
to correct both q -grams with one edit operation simultaneously (since
they overlap), and thus we cannot count both q -grams as mismatches.
Note that the definition above assumes that each string is long enough
to contain enough positional q -grams. It is easy to modify the definition
in case that the number of q -grams is not large enough. We can state
the following lemma.
 
Search WWH ::




Custom Search