Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

6.4 All-Match Self-join Queries

An all-match self-join query for a dataset S returns all pairs s, s in

the cross-product S

×

S with similarity larger than a user specified

threshold θ .

In case of self-join queries, if the dataset is not already indexed, an

alternative approach is based on indexing the dataset incrementally.

The procedure is shown in Algorithm 6.4.1.

The algorithm works for all similarity measures discussed in

Section 2.2. In addition, the multiway merge algorithm can be replaced

with any other selection algorithm (like nra , optimized multiway

merge , and heaviest first ). Notice that for Algorithm 6.4.1 to work

we have to process the strings in a well defined order. For the multiway

merge algorithm the strings have to be processed in increasing or decreas-

ing order of string identifiers. For nra , optimized multiway merge ,

and heaviest first , they have to be processed in increasing order of

L 1 -norms for Jaccard and Dice, and L 2 -norm for Cosine similarity.

Certain optimizations can be applied when the nra , optimized

multiway merge ,or heaviest first algorithms are used. Concentrat-

ing on Jaccard, assume that strings are processed in increasing order of

their L 1 -norm. As a new string s is processed, we know that all subse-

quent strings will have L 1 -norm larger than or equal to the norm of s .

Hence, by using Lemma 6.11, we can safely remove from the head of all

inverted lists all entries corresponding to strings with L 1 -norm smaller

than θ

1 . Clearly, none of the strings corresponding to these entries

will be candidates for any subsequent string. This helps save space by

decreasing the size of the inverted lists. Similar observations hold for

Dice and Cosine similarity.

s

Algorithm 6.4.1: Self-join( S,θ )

I is an empty inverted index of string tokens

for each s

∈

S

I. Multiway Merge ( s, θ )

I

←

s

Search WWH ::

Custom Search

Home