Database Reference
In-Depth Information
mismatch prefix . If two strings have an empty mismatch prefix inter-
section, then they contain enough mismatching q -grams for the edit
distance to be larger than θ and the pair can be pruned using the
shorter prefix.
We can identify the mismatch prefix by iteratively computing the
minimum destroy operations of all prefixes with lengths in the range
[ θ +1 ,qθ + 1]. We stop once the minimum destroy operations of the
current prefix is larger than θ . The cost of this algorithm is quadratic
in + 1. To reduce this cost we can identify the appropriate prefix
using binary search, based on the monotonicity of the minimum destroy
operations.
Lemma 8.5. Given a set of q -grams Q and any subset Q
Q
Minimum Destroy Operations( Q )
Minimum Destroy Operations( Q ) .
The algorithm is shown as Algorithm 8.2.2.
Algorithm 8.2.2: Mismatch Prefix( s )
Tokenize string: s = { ( λ 1 , 1) ,..., ( λ s m ,m ) } s.t. idf ( λ i ) ≥ idf ( λ j ) ,i<j
a
θ +1 ,c
+1
while a<c
b
( a + c ) / 2
( λ 1 ,p 1 ) ,..., ( λ b ,p b )
e
Minimum Destroy Operations(
{
}
)
do
if e
θ
then a
b +1
else c
b
return a
8.3 Trees
8.3.1
Tries and Sux Trees
The simplest way of using a trie to find strings within small edit
distance θ from the query is to generate all possible strings within
 
Search WWH ::




Custom Search