Database Reference
In-Depth Information
6.4 All-Match Self-join Queries
An all-match self-join query for a dataset S returns all pairs s, s in
the cross-product S
×
S with similarity larger than a user specified
threshold θ .
In case of self-join queries, if the dataset is not already indexed, an
alternative approach is based on indexing the dataset incrementally.
The procedure is shown in Algorithm 6.4.1.
The algorithm works for all similarity measures discussed in
Section 2.2. In addition, the multiway merge algorithm can be replaced
with any other selection algorithm (like nra , optimized multiway
merge , and heaviest first ). Notice that for Algorithm 6.4.1 to work
we have to process the strings in a well defined order. For the multiway
merge algorithm the strings have to be processed in increasing or decreas-
ing order of string identifiers. For nra , optimized multiway merge ,
and heaviest first , they have to be processed in increasing order of
L 1 -norms for Jaccard and Dice, and L 2 -norm for Cosine similarity.
Certain optimizations can be applied when the nra , optimized
multiway merge ,or heaviest first algorithms are used. Concentrat-
ing on Jaccard, assume that strings are processed in increasing order of
their L 1 -norm. As a new string s is processed, we know that all subse-
quent strings will have L 1 -norm larger than or equal to the norm of s .
Hence, by using Lemma 6.11, we can safely remove from the head of all
inverted lists all entries corresponding to strings with L 1 -norm smaller
than θ
1 . Clearly, none of the strings corresponding to these entries
will be candidates for any subsequent string. This helps save space by
decreasing the size of the inverted lists. Similar observations hold for
Dice and Cosine similarity.
s
Algorithm 6.4.1: Self-join( S,θ )
I is an empty inverted index of string tokens
for each s
S
I. Multiway Merge ( s, θ )
I
s
 
Search WWH ::




Custom Search