Database Reference
In-Depth Information
Algorithm 6.3.1: Sorted Block Nested Loop Join( S,R,θ )
Sort S,R in decreasing L 1 -norm order
I is an empty inverted index of string tokens
s f
first string in S
for each s
S
if memory is available
then I ← s
s l
s
for each r ∈ R s.t. θs f 1 ≤r 1 s l 1
do I. Heaviest First ( r, θ )
θ
else
I
←∅
,I
s
s f
s
Once both datasets are sorted, we simply index the strings in S until
no more memory is available and keep a pointer to the first and last
strings indexed s f ,s l . Then we switch to a probe only phase and process
the strings in R in sorted order, within the appropriate norm intervals
only (e.g., assuming Jaccard similarity, within [ θs f 1 ,s l 1 ]). After
the probe phase is complete, we discard the partial index of S and seek
back to element s l to continue indexing from where we left off. The
drawback of this algorithm is the extra cost of having to sort R . The
algorithm is shown in Algorithm 6.3.1.
More involved partitioning strategies that try to group strings into
clusters of similar strings can also be designed. In general, the par-
titioning strategy used, significantly affects the performance of join
processing. Notice that the hashing based partitioning approaches that
are used for equality joins cannot be applied for similarity joins because
this would typically imply that each record would need to be assigned
to multiple partitions. A good partitioning strategy should minimize
the number of partitions a record is assigned to, without making any
given partition large.
The only previously proposed partitioning algorithm is tailored for
weighted intersection and works in two phases. Let S and R be the
 
Search WWH ::




Custom Search