Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

in increasing (or decreasing) order of string identifiers. Moreover, the

algorithm has to exhaustively scan all lists in L v in order to iden-

tify all strings with similarity exceeding threshold θ . Merging of sorted

lists has been studied extensively with respect to external sorting algo-

rithms (where the goal is to merge multiple sorted runs of files), and the

algorithms used for our purposes are exactly the same. The multiway

merge procedure for sequences is shown in Algorithm 6.1.1. The algo-

rithm assumes that inverted lists are constructed for sequences and that

each list is sorted primarily by token positions and secondarily by string

identifiers. With this simple optimization the algorithm can directly

seek to the first element in each list with token position equal to the

token position in the query, and stop reading from a particular list once

all elements with token position equal to that of the query have been

Algorithm 6.1.1: Multiway Merge( v,θ )

( λ 1 , 1) ,..., ( λ v m ,m )

Tokenize the query: v =

{

}

M is a heap of ( s,

I s ,i ) tuples, sorted on s

S is a stack of indexes i

M

←∅

,S

←{

1 ,...,m

}

for i

1 to m

do Seek to first element of L ( λ i ) with token position p = i

while M is not empty or S is not empty

←

while S is not empty

i = S. pop

( s, p )= L ( λ i ) . next

if p>i then continue

if s

do

∈

M

then ( s,

do

M. find( s )

I s += W ( λ i ) and S

I

s ,j )

←

i

( s, W ( λ i ) ,i )

else M

←

( r,

M. pop

if I r ≥ θ then report ( r,I r )

S

I

r ,j )

←

j

Approximate String Processing

Search WWH ::

Custom Search

Home