Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

2. An unseen string s appears only in L ( λ 1 ) ,L ( λ 2 ). By con-

struction

s

≥

f 2

⇒N

( s, v ) <θ .

1

3. ...

m . An unseen string s appears only in L ( λ 1 ) ,...,L ( λ v m ). By con-

struction

s

1 >

f m

⇒N

( s, v ) <θ .

1

We can also use Observation 6.1 to improve the pruning power of

the nra algorithm by computing a best-case similarity for all candidate

strings before and after they have been inserted in the candidate set.

Strings whose best-case similarity is below the threshold can be pruned

immediately, reducing the memory requirements and the topic-keeping

cost of the algorithm. The best-case similarity of a new candidate uses

Observation 6.1 to identify lists that do not contain that candidate.

Lemma 6.4. Given query v , candidate string s , and a subset of lists

L v ⊆

L v potentially containing s (based on Observation 6.1 and the

frontier elements f i ), the best-case similarity score for s is

v 1

b ( s, v )=

N

1 ) .

max(

s

1 ,

v

It is important to note here that the above reasoning is correct if

and only if each string entry appears only once in every token list.

This is indeed the case for the frequency set and set representations,

but not always true for sequences, when the positional information is

disregarded (e.g., for computing the similarity of strings without regard

for positions). In that case, in order to compute the termination con-

dition correctly, we need to know for an arbitrary frontier element the

maximum number of times that the element might be contained in a

given list. The maximum frequency of an arbitrary entry per inverted

list can be computed at list construction time. Alternatively, we can

assume that a given token appears in a data string at most as many

times as it appears in the query, as a worst case scenario, and as a

result overestimate the similarity of the frontier elements.

An alternative implementation of the nra algorithm can postpone

updating candidate bit vectors and purging candidates from M until

Approximate String Processing

Search WWH ::

Custom Search

Home