Database Reference
In-Depth Information
the
probe
string. Determine the length of the prefix that must be considered, that is,
⌊
(1 −
J
)
L
s
⌋
+ 1. For each symbol appearing in one of the prefix positions of
s
, we look in the in-
dex bucket for that symbol, and we compare
s
with all the strings appearing in that bucket.
If we want to solve the many-many problem, start with an empty database of strings and
indexes. For each set
S
, we treat
S
as a new set for the many-one problem. We convert
S
to a string
s
, which we treat as a probe string in the many-one problem. However, after we
examine an index bucket, we also add
s
to that bucket, so
s
will be compared with later
strings that could be matches.
3.9.5
Using Position Information
Consider the strings
s
=
acdefghijk
and
t
=
bcdefghijk
, and assume
J
= 0
.
9. Since
both strings are of length 10, they are indexed under their first two symbols. Thus,
s
is in-
dexed under
a
and
c
, while
t
is indexed under
b
and
c
. Whichever is added last will find
the other in the bucket for
c
, and they will be compared. However, since
c
is the second
symbol of both, we know there will be two symbols,
a
and
b
in this case, that are in the
union of the two sets but not in the intersection. Indeed, even though
s
and
t
are identical
from
c
to the end, their intersection is 9 symbols and their union is 11; thus SIM(
s, t
) =
9/11, which is less than 0.9.
If we build our index based not only on the symbol, but on the position of the symbol
within the string, we could avoid comparing
s
and
t
above. That is, let our index have a
bucket for each pair (
x, i
), containing the strings that have symbol
x
in position
i
of their
prefix. Given a string
s
, and assuming
J
is the minimum desired Jaccard distance, we look
at the prefix of
s
, that is, the positions 1 through
⌊
(1 −
J
)
L
s
⌋
+ 1. If the symbol in position
i
of the prefix is
x
, add
s
to the index bucket for (
x, i
).
Now consider
s
as a probe string. With what buckets must it be compared? We shall visit
the symbols of the prefix of
s
from the left, and we shall take advantage of the fact that
we only need to find a possible matching string
t
if none of the previous buckets we have
examined for matches held
t
. That is, we only need to find a candidate match once. Thus, if
we find that the
i
th symbol of
s
is
x
, then we need look in the bucket (
x, j
) for certain small
values of
j
.
To compute the upper bound on
j
, suppose
t
is a string none of whose first
j
− 1 symbols
matched anything in
s
, but the
i
th symbol of
s
is the same as the
j
th symbol of
t
. The highest
value of SIM(
s, t
) occurs if
s
and
t
are identical beyond their
i
th and
j
th symbols, respect-
+ 1, since that is the number of symbols of
s
that could possibly be in
t
. The size of their
union is at least
L
s
+
j
− 1. That is,
s
surely contributes
L
s
symbols to the union, and there