Algorithms for Edit Based Similarity - Approximate String Processing

Database Reference

In-Depth Information

Algorithm 8.3.4: TopKQuery( v,N,θ,s l ,s u ,H )

if N is the leaf node

for each s

∈

N

if Edit Distance Verification ( v, s, θ )

do

Insert s into H

if

do

>k : H. pop

Update the global threshold θ

|

H

|

else

if LB ( v, [ s l ,s 1 ])

≤

θ

do TopKQuery ( v,N 1 ,θ,s l ,s 1 ,H )

for j =2 to m

do if LB ( v, [ s j− 1 ,s j ])

do

≤

θ

do TopKQuery ( v,N j ,θ,s j− 1 ,s j ,H )

if LB ( v, [ s m ,s u ])

≤

θ

do TopKQuery ( v,N m +1 ,s m ,s u ,H )

We use LB ([ s l ,s u ] , [ r l ,r u ]) to denote the lower bound edit distance

between the two string intervals. This property allows the direct adop-

tion of the standard join algorithm, as is shown in Algorithm 8.3.5. The

algorithm recursively expands the nodes in depth-first order, to give a

clear idea on the joining process. In practice we can use a heap to store

the node pairs and pop out the next candidate to join if it satisfies the

minimal distance lower bound.

Finally, this B-tree index scheme is also capable of handling nor-

malized edit distance. This is achievable if the maximal length of the

strings within an interval can be estimated by the string order:

Property 8.4(Length Bounding). Given any string interval [ s i ,s j ],

the string order φ is length bounding if it e ciently returns an upper

bound on the length of any string s

∈

[ s i ,s j ].

The length bounding property can be combined with any of the

query processing algorithms, by dividing the lower bound edit distance

with the maximal length of the string intervals. Therefore, this index

Approximate String Processing

Search WWH ::

Custom Search

Home