Databases Reference
In-Depth Information
symbols. Thus, s is indexed under a and c , while t is indexed under b and c .
Whichever is added last will find the other in the bucket for c , and they will be
compared. However, since c is the second symbol of both, we know there will
be two symbols, a and b in this case, that are in the union of the two sets but
not in the intersection. Indeed, even though s and t are identical from c to the
end, their intersection is 9 symbols and their union is 11; thus SIM (s, t) = 9/11,
which is less than 0.9.
If we build our index based not only on the symbol, but on the position of
the symbol within the string, we could avoid comparing s and t above. That
is, let our index have a bucket for each pair (x, i), containing the strings that
have symbol x in position i of their prefix. Given a string s, and assuming J is
the minimum desired Jaccard distance, we look at the prefix of s, that is, the
positions 1 through⌊(1−J )L s
⌋+ 1. If the symbol in position i of the prefix is
x, add s to the index bucket for (x, i).
Now consider s as a probe string. With what buckets must it be compared?
We shall visit the symbols of the prefix of s from the left, and we shall take
advantage of the fact that we only need to find a possible matching string t if
none of the previous buckets we have examined for matches held t. That is, we
only need to find a candidate match once. Thus, if we find that the ith symbol
of s is x, then we need look in the bucket (x, j) for certain small values of j.
Symbols definitely
appearing in
only one string
i
s
t
j
Figure 3.14: Strings s and t begin with i−1 and j−1 unique symbols, respec-
tively, and then agree beyond that
To compute the upper bound on j, suppose t is a string none of whose first
j−1 symbols matched anything in s, but the ith symbol of s is the same as the
jth symbol of t. The highest value of SIM (s, t) occurs if s and t are identical
beyond their ith and jth symbols, respectively, as suggested by Fig. 3.14. If
that is the case, the size of their intersection is L s −i + 1, since that is the
number of symbols of s that could possibly be in t. The size of their union is
at least L s + j−1. That is, s surely contributes L s symbols to the union, and
there are also at least j−1 symbols of t that are not in s. The ratio of the sizes
of the intersection and union must be at least J , so we must have:
−i + 1
L s + j−1
L s
≥J
Search WWH ::




Custom Search