Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

initialized with one event s, 1 , 1 . 0 per string s ∈ S , and maintained in

decreasing order of

s .

The similarity upper bound for a given string s is computed by

considering the best case scenario. Abusing notation, let P

π ( s ))

denote the prefix (sux) of string s containing exactly the first π

tokens (all remaining tokens). By construction, the tokens in the cur-

rent indexed prefix P

π ( s )( S

π ( r ) of any string r that is not already contained

in the answer set, do not match any of the tokens in the already indexed

prefix

π ( s )of s . Then, the best-case similarity upper bound between

s and r is when all of the tokens in the current (unseen) sux S

π ( r )of

π ( s )of s . The potential Jaccard

r match those in the unknown sux

similarity becomes

π ( s )

1 ≤ S

( s, r )

≤

π ( r )

π ( s )

1 +

π ( s )

Here, we need to compute the upper bound similarity of s with respect

to an arbitrary string r , thus based only on information in s ; drop-

ping the term

π ( r )

1 from the denominator and maximizing the

numerator yields the desired upper bound. Hence

π ( s )

s = S

−P

s 1

An interesting observation here is that if we assume unit token weights

we can improve the similarity upper bound based on the fact that

during each iteration of the algorithm the indexed prefixes consist of

exactly the same number of tokens for all strings. In that case we know

that π = π (i.e.,

π ( r )

π ( s )

1 =

1 ), hence the similarity upper

bound becomes

π ( s )

s =

π ( s ) 1 .

The algorithm proceeds by extracting the top event from the event

heap, say

s 1 + P

, and then scans inverted list L ( λ π ) to identify new

candidates. The exact similarity of each candidate is computed and

the top- k result heap is updated appropriately. Then, the algorithm

needs to index the new token λ π and compute the similarity upper

bound

s, π,

π +1

of s with any unseen string that does not agree with s

Approximate String Processing

Search WWH ::

Custom Search

Home