Database Reference
In-Depth Information
A, s
∈
identify the answer set
A
, s.t.
|
A
|
=
k
and
∀
s
∈
S
\
A
:Θ(
v,s
)
≥
Θ(
v,s
).
Top-
k
queries are very useful in practice since in many applications it
is dicult to decide in advance a meaningful threshold
θ
for running
an all-match query.
Clearly, all-match queries are easier to evaluate than top-
k
queries,
given that a cutoff similarity threshold for top-
k
queries cannot be
decided in advance, making initial pruning of strings di
cult. Nev-
ertheless, once
k
good answers have been identified (good in a sense
that the
k
-th answer has similarity suciently close to the correct
k
-th
answer) top-
k
queries essentially degenerate to all-match queries. Query
answering strategies typically try to identify
k
good answers as fast as
possible and subsequently revert to all-match query strategies.
4.2
Join Queries
Given two sets of strings and a user specified threshold, all-match join
queries return all pairs of strings in the cross product of the two sets,
with similarity larger than or equal to the threshold.
Definition 4.3 (All-Match Join Query).
Given a string similarity
function Θ, two sets of strings
S,R
, and a positive threshold
θ
, identify
the answer set
A
=
{
(
s, r
)
∈ S × R
:Θ(
s, r
)
≥ θ}.
Top-
k
join queries return the
k
pairs with the largest similarity among
all pairs in the cross product.
Definition 4.4(Top-
Join Query).
Given a string similarity func-
tion Θ, two sets of strings
S,R
, and a positive integer
k
, identify
the answer set
A
s.t.
k
(
s
,r
)
|
A
|
=
k
and
∀
(
s, r
)
∈
A
and
∀
∈
(
S
×
R
)
\
Θ(
s
,r
).
A
:Θ(
s, r
)
≥