Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

answers as eciently as possible. Clearly, given a query, it is impossi-

ble to know a priori what the k -th similarity will be. Hence, we can

only resort to heuristics. A simple idea is to start merging lists until

the similarity of k elements has been computed. Then, we populate a

heap using those k candidates and use the k -th similarity as the query

threshold. The algorithm proceeds by maintaining the top k elements

in the heap as the similarity of new candidates is computed. The cur-

rent, always increasing, k -th similarity can be used by the optimized

multiway merge algorithm as a threshold for skipping elements from

the lists and terminating early.

6.2.2

Lists Sorted in L p -norm Order

As we argued for all-match selection queries, Weighted Intersection

similarity cannot benefit from L p -norm ordering, since the function

itself does not depend on any string norm.

Consider Normalized Weighted Intersection. Let token lists be

sorted in increasing order of L 1 -norms. We modify the nra algorithm

(see Algorithm 6.1.2) to maintain a heap H containing the k strings

with the largest partial similarity computed so far from all strings

appearing in candidate set M . The partial similarity of a string is

clearly a lower bound on the true similarity (not to be confused with

the best-case similarity computed using Lemma 6.4, which is an upper

bound on the similarity). Recall that nra maintains a best-case similar-

ity score

f is an upper

bound on the similarity of any string that has not been encountered in

any of the lists yet. Let

f

N

for a conceptual frontier string. Clearly

N

k be the k -th smallest similarity score in heap

H . Straightforwardly, when

N

f <

k

N

there are no strings s

∈

M s.t.

k , hence the algorithm can stop inserting new strings in M .

The algorithm can also evict from M strings whose best-case similarity

is smaller than

N

( s, v )

≥N

k and the best-case similarity of strings

in M keep getting tighter. After the frontier condition has been met,

the algorithm needs to simply complete the partial similarity scores of

strings already in M in order to determine the final top-k answers.

The important question is how to seed the algorithm with a good

set of k initial candidates. The following observation is essential.

k ,asboth

N

Approximate String Processing

Search WWH ::

Custom Search

Home