Database Reference
In-Depth Information
in increasing (or decreasing) order of string identifiers. Moreover, the
algorithm has to exhaustively scan all lists in L v in order to iden-
tify all strings with similarity exceeding threshold θ . Merging of sorted
lists has been studied extensively with respect to external sorting algo-
rithms (where the goal is to merge multiple sorted runs of files), and the
algorithms used for our purposes are exactly the same. The multiway
merge procedure for sequences is shown in Algorithm 6.1.1. The algo-
rithm assumes that inverted lists are constructed for sequences and that
each list is sorted primarily by token positions and secondarily by string
identifiers. With this simple optimization the algorithm can directly
seek to the first element in each list with token position equal to the
token position in the query, and stop reading from a particular list once
all elements with token position equal to that of the query have been
Algorithm 6.1.1: Multiway Merge( v,θ )
( λ 1 , 1) ,..., ( λ v m ,m )
Tokenize the query: v =
{
}
M is a heap of ( s,
I s ,i ) tuples, sorted on s
S is a stack of indexes i
M
←∅
,S
←{
1 ,...,m
}
for i
1 to m
do Seek to first element of L ( λ i ) with token position p = i
while M is not empty or S is not empty
while S is not empty
i = S. pop
( s, p )= L ( λ i ) . next
if p>i then continue
if s
do
M
then ( s,
do
M. find( s )
I s += W ( λ i ) and S
I
s ,j )
i
( s, W ( λ i ) ,i )
else M
( r,
M. pop
if I r ≥ θ then report ( r,I r )
S
I
r ,j )
j
 
Search WWH ::




Custom Search