Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

Algorithm 6.3.1: Sorted Block Nested Loop Join( S,R,θ )

Sort S,R in decreasing L 1 -norm order

I is an empty inverted index of string tokens

s f ←

first string in S

for each s

∈

S

if memory is available

then I ← s

s l

←

s

for each r ∈ R s.t. θs f 1 ≤r 1 ≤ s l 1

do I. Heaviest First ( r, θ )

θ

else

I

←∅

,I

←

s

s f

←

s

Once both datasets are sorted, we simply index the strings in S until

no more memory is available and keep a pointer to the first and last

strings indexed s f ,s l . Then we switch to a probe only phase and process

the strings in R in sorted order, within the appropriate norm intervals

only (e.g., assuming Jaccard similarity, within [ θs f 1 ,s l 1 /θ ]). After

the probe phase is complete, we discard the partial index of S and seek

back to element s l to continue indexing from where we left off. The

drawback of this algorithm is the extra cost of having to sort R . The

algorithm is shown in Algorithm 6.3.1.

More involved partitioning strategies that try to group strings into

clusters of similar strings can also be designed. In general, the par-

titioning strategy used, significantly affects the performance of join

processing. Notice that the hashing based partitioning approaches that

are used for equality joins cannot be applied for similarity joins because

this would typically imply that each record would need to be assigned

to multiple partitions. A good partitioning strategy should minimize

the number of partitions a record is assigned to, without making any

given partition large.

The only previously proposed partitioning algorithm is tailored for

weighted intersection and works in two phases. Let S and R be the

Approximate String Processing

Search WWH ::

Custom Search

Home