Database Reference
In-Depth Information
6.4 All-Match Self-join Queries
An all-match self-join query for a dataset
S
returns all pairs
s, s
in
the cross-product
S
×
S
with similarity larger than a user specified
threshold
θ
.
In case of self-join queries, if the dataset is not already indexed, an
alternative approach is based on indexing the dataset incrementally.
The procedure is shown in Algorithm 6.4.1.
The algorithm works for all similarity measures discussed in
Section 2.2. In addition, the
multiway merge
algorithm can be replaced
with any other selection algorithm (like
nra
,
optimized multiway
merge
, and
heaviest first
). Notice that for Algorithm 6.4.1 to work
we have to process the strings in a well defined order. For the
multiway
merge
algorithm the strings have to be processed in increasing or decreas-
ing order of string identifiers. For
nra
,
optimized multiway merge
,
and
heaviest first
, they have to be processed in increasing order of
L
1
-norms for Jaccard and Dice, and
L
2
-norm for Cosine similarity.
Certain optimizations can be applied when the
nra
,
optimized
multiway merge
,or
heaviest first
algorithms are used. Concentrat-
ing on Jaccard, assume that strings are processed in increasing order of
their
L
1
-norm. As a new string
s
is processed, we know that all subse-
quent strings will have
L
1
-norm larger than or equal to the norm of
s
.
Hence, by using Lemma 6.11, we can safely remove from the head of all
inverted lists all entries corresponding to strings with
L
1
-norm smaller
than
θ
1
. Clearly, none of the strings corresponding to these entries
will be candidates for any subsequent string. This helps save space by
decreasing the size of the inverted lists. Similar observations hold for
Dice and Cosine similarity.
s
Algorithm 6.4.1:
Self-join(
S,θ
)
I
is an empty inverted index of string tokens
for each
s
∈
S
I.
Multiway Merge
(
s, θ
)
I
←
s