Algorithms for Edit Based Similarity - Approximate String Processing

Database Reference

In-Depth Information

Algorithm 8.3.2: FindNode( v,N )

if N is the leaf node: return N

for each splitting string s j

∈

N

do if φ ( v ) ≤ φ ( s j ): return FindNode ( v,N j )

return FindNode ( v,N m +1 )

searches the subtree from the corresponding pointer all the way to the

leaf level of the tree. The implementation of insertion, deletion, and

split operations follows similar strategies by using the new comparison

oracle between φ ( s ) and φ ( r ).

From Algorithm 8.3.2, we can see that all strings stored in the

sub-tree rooted at N j have mapping values in [ φ ( s j− 1 ) ,φ ( s j )] by

construction. To simplify notation, we say that s

∈

[ s i ,s j ]if φ ( s i )

≤

φ ( s ) ≤ φ ( s j ).

The following property enables the B-tree structure to handle range

queries based on edit distance:

Property 8.2 (Lower Bounding). A string order φ is lower bound-

ing if it eciently returns the minimal edit distance between string v

and any s

∈

[ s i ,s j ].

With Property 8.2 the B-tree can handle range queries using Algo-

rithm 8.3.3. In the algorithm we use LB ( s i , [ s j− 1 ,s j ]) to denote the

lower bound on the edit distance between s i and any string s

∈

[ s j− 1 ,s j ]. Algorithm 8.3.3 iteratively visits the nodes with lower bound

edit distance no larger than θ and verifies the strings found at the leaf

level of the tree using Algorithm 2.1.1. Notice that the algorithm might

have to traverse multiple paths down the tree (as opposed to the stan-

dard B-tree traversal algorithm). The minimal and maximal strings s l

and s u indicate the boundaries of any string in a given subtree with

respect to the string order φ . This information can be retrieved from

the parent node, as the algorithm implies. It is easy to verify that this

algorithm accurately returns all strings within distance θ from query

Search WWH ::

Custom Search

Home