Database Reference
In-Depth Information
the probe string. Determine the length of the prefix that must be considered, that is, (1 −
J ) L s + 1. For each symbol appearing in one of the prefix positions of s , we look in the in-
dex bucket for that symbol, and we compare s with all the strings appearing in that bucket.
If we want to solve the many-many problem, start with an empty database of strings and
indexes. For each set S , we treat S as a new set for the many-one problem. We convert S
to a string s , which we treat as a probe string in the many-one problem. However, after we
examine an index bucket, we also add s to that bucket, so s will be compared with later
strings that could be matches.
3.9.5
Using Position Information
Consider the strings s = acdefghijk and t = bcdefghijk , and assume J = 0 . 9. Since
both strings are of length 10, they are indexed under their first two symbols. Thus, s is in-
dexed under a and c , while t is indexed under b and c . Whichever is added last will find
the other in the bucket for c , and they will be compared. However, since c is the second
symbol of both, we know there will be two symbols, a and b in this case, that are in the
union of the two sets but not in the intersection. Indeed, even though s and t are identical
from c to the end, their intersection is 9 symbols and their union is 11; thus SIM( s, t ) =
9/11, which is less than 0.9.
If we build our index based not only on the symbol, but on the position of the symbol
within the string, we could avoid comparing s and t above. That is, let our index have a
bucket for each pair ( x, i ), containing the strings that have symbol x in position i of their
prefix. Given a string s , and assuming J is the minimum desired Jaccard distance, we look
at the prefix of s , that is, the positions 1 through (1 − J ) L s + 1. If the symbol in position i
of the prefix is x , add s to the index bucket for ( x, i ).
Now consider s as a probe string. With what buckets must it be compared? We shall visit
the symbols of the prefix of s from the left, and we shall take advantage of the fact that
we only need to find a possible matching string t if none of the previous buckets we have
examined for matches held t . That is, we only need to find a candidate match once. Thus, if
we find that the i th symbol of s is x , then we need look in the bucket ( x, j ) for certain small
values of j .
To compute the upper bound on j , suppose t is a string none of whose first j − 1 symbols
matched anything in s , but the i th symbol of s is the same as the j th symbol of t . The highest
value of SIM( s, t ) occurs if s and t are identical beyond their i th and j th symbols, respect-
ively, as suggested by Fig. 3.14 . If that is the case, the size of their intersection is L s i
+ 1, since that is the number of symbols of s that could possibly be in t . The size of their
union is at least L s + j − 1. That is, s surely contributes L s symbols to the union, and there
Search WWH ::




Custom Search