Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

examined. This optimization is irrelevant when indexing frequency-sets

or sets. In addition, one could allow token positions to vary within some

bounds of the corresponding token position in the query, in order to

evaluate sequence based similarity measures that allow tokens to have

slightly different positions in the query and the resulting data strings

(e.g., recall the example 'The Bill & Melinda Gates Foundations' and

'The Melinda & Bill Gates Foundation'). An important detail that is

missing from Algorithm 6.1.1 for simplicity, is that whenever a given

token λ appears multiple times in different positions, the algorithm

should scan list L ( λ ) once, simultaneously for all positions. We omit

this detail in all subsequent algorithms as well.

Computing Normalized Weighted Intersection is also straightforward,

provided that the L 1 -norm of every data string is stored in the token lists

(see Definition 2.6). Hence, the lists store tuples string-identifier/token-

position/ L 1 -norm per string for sequences, string-identifier/token-

frequency/ L 1 -norm for frequency-sets, and string-identifier/ L 1 -norm

for sets. Once again, if lists are already sorted in increasing order of string

identifiers, a multiway merge algorithm can evaluate the normalized

weighted intersection between the query and all relevant strings very e-

ciently. The algorithm for computing normalized weighted intersection

essentially boils down to computing Weighted Intersection and hence is

very similar to Algorithm 6.1.1. Similar arguments hold for Jaccard, Dice

and Cosine similarity.

A number of optimizations are possible in case of unit weights (i.e.,

when ∀λ ∈ Λ ,W ( λ ) = 1). For unit weights, the problem of identifying

strings with similarity exceeding the query threshold θ is equivalent to

the problem of identifying strings that appear in at least θ lists (since

all the lists have the same weight, it does not matter which list a string

appears in). In this case, in every iteration of the multiway merge algo-

rithm, first we pop all occurrences of the top element of the heap. If that

element occurs at least θ times we report it. Otherwise, assume that it

appears a total of x times. We pop from the heap another θ

x ele-

ments, identify the new top element of the heap, let it be r , and directly

skip to the first element s

−

1

−

r in every token list (we can identify such

elements using binary search on the respective lists). Finally, we insert

the new top elements from every list for which a skip actually occurred

≥

Search WWH ::

Custom Search

Home