Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

Observation 6.3. The answers that are potentially more similar to

the query have L 1 -norm equal to

1 poten-

tially yield the maximum similarity which is equal to one. Intuitively,

the potential similarity of candidate strings decreases as the L 1 -norm

of strings diverges from that of the query in either direction.

v

1 . Strings with

s

1 =

v

1

does not necessarily imply that s = v since s and v might differ in a

number of tokens that happen to have equal weights resulting in equal

L 1 -norms.

Based on Observation 6.3, the most promising candidates are the

strings with L 1 -norms clustered around the L 1 -norm of the query. We

can exploit this intuition in order to find a better k -th similarity approx-

imation faster. In essence, instead of starting the round robin iterations

of the nra algorithm from the beginning of the lists, we start from the

first string s within each list s.t. s 1 = v 1 . Then, we perform round

robin iterations in both directions, once towards the end of the lists

and once towards the beginning. When the algorithm has identified k

strings, the k -th smallest partial similarity becomes the query threshold

θ , and the algorithm can revert to normal all-match selection process-

ing by scanning lists within the L 1 -norm intervals given by Lemma

6.1. Notice that the algorithm is a heuristic. There is no guarantee

regarding the quality of the k -th partial similarity discovered using this

approach. In the best case the algorithm might immediately identify k

strings equal to the query and stop. In the worst case the algorithm

might be worse than vanilla nra (e.g., if it happens that the k -th most

similar string is either the first or the last element in all lists).

Another heuristic is to use an optimistic approach that assumes that

there are at least k data strings with similarity equal to one. Then, we

can run an all-match selection query using θ = 1. If the result contains

fewer than k answers, we reduce θ deterministically and restart the

previous query from where it stopped. We continue until k answers have

been produced. Once again, depending on the data and the particular

query this algorithm can either produce all answers within one iteration

Notice that we refer to potential similarity given that

s

1 =

v

Approximate String Processing

Search WWH ::

Custom Search

Home