Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

Notice that for weighted intersection the L 1 -norm does not play any

role in the similarity of two strings, hence we cannot design an L 1 -

norm filter to prune strings. Nevertheless, assume that we use the nra

algorithm to merge token lists. Weighted intersection is a monotone

aggregate function with partial weights α i ( s, v )= W ( λ i ). Notice that

in this case the partial weights of all strings in a given token list are con-

stant (i.e., α i ( s, v )= α i ( s ,v ) ,

s, s ∈

L ( λ i )). An immediate conclusion

∀

f = i =1 W ( λ i ) <θ of the nra algo-

rithm will never be true (for 0 <θ

I

is that the terminating condition

≤ i =1 W ( λ i )). Hence, given that

neither an L 1 -norm filter nor a termination condition can be applied,

the nra algorithm will have to exhaustively scan all token lists.

6.1.3

Prioritization Across Lists

This section focuses on strategies that take advantage of specific access

priorities across inverted lists. Notice that for the multiway merge and

threshold based algorithms the order in which inverted lists are pro-

cessed is not important (e.g., the order of the round robin iterations

over the inverted lists for the nra algorithm). Nevertheless, the distribu-

tion of tokens in the data strings can help determine a more meaningful

ordering in which inverted lists are processed, that can yield significant

performance benefits.

Let the lists in L v be sorted according to their respective token

weights, from heaviest to lightest. Without loss of generality let this

ordering be L v =

W ( λ v m ). Con-

sidering the partial token weights of a given similarity function, since

α 1 ( s, v )

L ( λ 1 ) ,...,L ( λ v m )

s.t. W ( λ 1 )

{

}

≥ ··· ≥

α m ( s, v ), the most promising candidates appear in list

L ( λ 1 ); the second most promising appear in L ( λ 2 ); and so on. This

leads to an alternative sequential algorithm, which we refer to here as

heaviest first . The algorithm exhaustively reads the heaviest token

list first, and stores all strings in a candidate set. The second to heaviest

list is scanned next. The similarity of strings that have already been

encountered in the previous list is updated, and new candidates are

added in the candidate set, until all lists have been scanned and the

similarities of all candidates have been computed. While a new list is

being traversed the algorithm prunes the candidates in the candidate

≥ ··· ≥

Approximate String Processing

Search WWH ::

Custom Search

Home