Database Reference
In-Depth Information
3.9.4
Prefix Indexing
In addition to length, there are several other features of strings that can be exploited to limit
the number of comparisons that must be made to identify all pairs of similar strings. The
simplest of these options is to create an index for each symbol; recall a symbol of a string
is any one of the elements of the universal set. For each string s , we select a prefix of s
consisting of the first p symbols of s . How large p must be depends on L s and J , the lower
bound on Jaccard distance. We add string s to the index for each of its first p symbols.
In effect, the index for each symbol becomes a bucket of strings that must be compared.
We must be certain that any other string t such that SIM( s, t ) ≥ J will have at least one sym-
bol in its prefix that also appears in the prefix of s .
Suppose not; rather SIM( s, t ) ≥ J , but t has none of the first p symbols of s . Then the
highest Jaccard similarity that s and t can have occurs when t is a suffix of s , consisting of
everything but the first p symbols of s . The Jaccard similarity of s and t would then be ( L s
p )/ L s . To be sure that we do not have to compare s with t , we must be certain that J >
( L s p )/ L s . That is, p must be at least (1 − J ) L s + 1. Of course we want p to be as small
as possible, so we do not index string s in more buckets than we need to. Thus, we shall
hereafter take p = (1 − J ) L s + 1 to be the length of the prefix that gets indexed.
EXAMPLE 3.26 Suppose J = 0 . 9. If L s = 9, then p = 0 . 1×9 +1 = 0 . 9 +1 = 1. That is, we
need to index s under only its first symbol. Any string t that does not have the first symbol
of s in a position such that t is indexed by that symbol will have Jaccard similarity with s
that is less than 0.9. Suppose s is bcdefghij . Then s is indexed under b only. Suppose t
does not begin with b . There are two cases to consider.
(1) If t begins with a and SIM( s, t ) ≥ 0 . 9, then it can only be that t is abcdefghij . But
if that is the case, t will be indexed under both a and b . The reason is that L t = 10, so t
will be indexed under the symbols of its prefix of length 0 . 1 × 10 + 1 = 2.
(2) If t begins with c or a later letter, then the maximum value of SIM( s, t ) occurs when t
is cdefghij . But then SIM( s, t ) = 8/9 < 0 . 9.
In general, with J = 0 . 9, strings of length up to 9 are indexed by their first symbol, strings
of lengths 10-19 are indexed under their first two symbols, strings of length 20-29 are in-
dexed under their first three symbols, and so on.
We can use the indexing scheme in two ways, depending on whether we are trying to
solve the many-many problem or a many-one problem; recall the distinction was intro-
duced in Section 3.8.4 . For the many-one problem, we create the index for the entire data-
base. To query for matches to a new set S , we convert that set to a string s , which we call
Search WWH ::




Custom Search