Database Reference
In-Depth Information
string domain to an integer space where strings are sorted in dictionary
order. We denote this sorted dictionary order with φ d .
It is obvious that dictionary order follows the property of compa-
rability. Given two strings s and r , it is sucient to find the most
significant position p where the two strings differ. If π ( s [ p ]) ( r [ p ]),
we can assert that s precedes r in dictionary order, and vice versa. This
comparison can be done in linear time with respect to the length of the
strings — we do not need to actually instantiate order φ d .
Dictionary order is also consistent with the property of lower bound-
ing. Given a string interval [ φ d ( s i ) d ( s j )] (or for simplicity [ s i ,s j ]) in
dictionary order, we know that all strings in this interval must share
the longest common prefix of s i and s j , i.e., LCP ( s i ,s j ). To be more
precise, if s
[ s i ,s j ], we have:
p
[1 ,
|
LCP ( s i ,s j )
|
] , [ p ]= s i [ p ]= s j [ p ] .
(8.1)
. In fact, we can actually use letter s [ p +1]to
refine the lower bound even further:
Let p =
|
LCP ( s i ,s j )
|
s i [ p +1] ≤ s [ p +1] ≤ s j [ p +1] ,p = |LCP ( s i ,s j ) |.
(8.2)
For example, consider the string interval ['food', 'found']. Any string
within this interval must have the prefix 'fo' and the third character
must be in the interval ['o', 'u']. The sux after the third character can
be any valid string in the alphabet (with arbitrary length). Notice that
this does not imply that all of these strings with unknown arbitrary
suxes are actually contained in interval ['food', 'found']. We simply
return a super-set of the strings in the interval which is sucient for
computing a correct (albeit) loose lower bound.
Equation
(8.2)
is
valid
only
if
|
s i
|
>
|
LCP ( s i ,s j )
|
. f
|
s i
|
=
|
(i.e., s i is completely covered by s j ), then no further
refinement of the lower bound is possible.
Given Equations (8.1) and (8.2), we can now derive an ecient
lower bound computation between a query string v and any string s
LCP ( s i ,s j )
|
[ s i ,s j ], based on the edit distance verification Algorithm 2.1.1. Table 8.3
shows a running example of the computation of the lower bound on the
edit distance between query 'fund' and interval ['food', 'found'], with
threshold θ = 1. Notice that the third row of the table uses a candidate
Search WWH ::




Custom Search