Database Reference
In-Depth Information
Notice that for weighted intersection the L 1 -norm does not play any
role in the similarity of two strings, hence we cannot design an L 1 -
norm filter to prune strings. Nevertheless, assume that we use the nra
algorithm to merge token lists. Weighted intersection is a monotone
aggregate function with partial weights α i ( s, v )= W ( λ i ). Notice that
in this case the partial weights of all strings in a given token list are con-
stant (i.e., α i ( s, v )= α i ( s ,v ) ,
s, s
L ( λ i )). An immediate conclusion
f = i =1 W ( λ i ) of the nra algo-
rithm will never be true (for 0
I
is that the terminating condition
i =1 W ( λ i )). Hence, given that
neither an L 1 -norm filter nor a termination condition can be applied,
the nra algorithm will have to exhaustively scan all token lists.
6.1.3
Prioritization Across Lists
This section focuses on strategies that take advantage of specific access
priorities across inverted lists. Notice that for the multiway merge and
threshold based algorithms the order in which inverted lists are pro-
cessed is not important (e.g., the order of the round robin iterations
over the inverted lists for the nra algorithm). Nevertheless, the distribu-
tion of tokens in the data strings can help determine a more meaningful
ordering in which inverted lists are processed, that can yield significant
performance benefits.
Let the lists in L v be sorted according to their respective token
weights, from heaviest to lightest. Without loss of generality let this
ordering be L v =
W ( λ v m ). Con-
sidering the partial token weights of a given similarity function, since
α 1 ( s, v )
L ( λ 1 ) ,...,L ( λ v m )
s.t. W ( λ 1 )
{
}
≥ ··· ≥
α m ( s, v ), the most promising candidates appear in list
L ( λ 1 ); the second most promising appear in L ( λ 2 ); and so on. This
leads to an alternative sequential algorithm, which we refer to here as
heaviest first . The algorithm exhaustively reads the heaviest token
list first, and stores all strings in a candidate set. The second to heaviest
list is scanned next. The similarity of strings that have already been
encountered in the previous list is updated, and new candidates are
added in the candidate set, until all lists have been scanned and the
similarities of all candidates have been computed. While a new list is
being traversed the algorithm prunes the candidates in the candidate
≥ ··· ≥
Search WWH ::




Custom Search