Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

index on string identifiers is available on each of the lists in the sux,

the algorithm needs to perform, on average, one random access per

candidate per remaining list. If an index is not available, then assum-

ing that lists are sorted in increasing norm order, the algorithm keeps

track of the smallest and largest norm in the candidate set, seeks to the

first list element with norm larger equal to the smallest norm and scans

each list as deep as the largest norm. Alternatively, binary search can

be performed. This strategy will help potentially prune a long head

and tail from each list in the sux. Alternatively, if the size of the

remaining candidate set is small after the heaviest first strategy on the

shortest lists terminates, the algorithm can simply retrieve the strings

and compute the actual similarity.

Notice that in case of uniform token weights, it is meaningless to

order lists according to weights. In this special case lists are simply

sorted in increasing order of their lengths and all the algorithms dis-

cussed so far will work without any modifications, with similar ben-

efits. This results in a shortest-first prioritization of lists, which is of

independent interest, as was already mentioned above. In addition, all

L p -norm filters are still applicable: L 1 -norm reduces to the L 0 -norm

and L 2 -norm reduces to the square root of the L 0 -norm.

Surprisingly, the heaviest first strategy can also help prune

candidates for Weighted Intersection similarity, based on the obser-

vation that strings containing the heaviest query tokens are more likely

to exceed the threshold. Let query string v = λ 1 ···

λ v m and threshold

0 <θ≤ i =1 W ( λ i ), and assume without loss of generality that lists

are sorted in decreasing order of token weights (i.e., W ( λ 1 )

≥

W ( λ v m )). Assume that there exists a string s that is contained only

in a su x L ( λ k ) ,...,L ( λ v m ) of token lists, whose aggregate weight

≥

...

i = k W ( λ i ) <θ . Clearly, such a string cannot exceed the query thresh-

old. An immediate conclusion is that

Lemma 6.16. Let query v = λ 1 ···λ v m s.t. W ( λ 1 )

≥ ··· ≥ W ( λ v m ), and

≤ i =1 W ( λ i ). Let

threshold 0 <θ

|v|

W ( λ i ) ≥ θ.

π = arg max

1

≤

π

≤

m

i = π

Approximate String Processing

Search WWH ::

Custom Search

Home