Database Reference
In-Depth Information
are also at least j − 1 symbols of t that are not in s . The ratio of the sizes of the intersection
and union must be at least J , so we must have:
(
)
If we isolate j in this inequality, we have j ≤ ( L s (1 − J ) − i + 1 + J )/ J .
Figure 3.14 Strings s and t begin with i − 1 and j − 1 unique symbols, respectively, and then agree beyond that
EXAMPLE 3.27 Consider the string s = acdefghijk with J = 0 . 9 discussed at the begin-
ning of this section. Suppose s is now a probe string. We already established that we need
to consider the first two positions; that is, i can be 1 or 2. Suppose i = 1. Then j ≤ (10 ×
0 . 1 − 1 + 1 + 0 . 9)/0 . 9. That is, we only have to compare the symbol a with strings in the
bucket for ( a , j ) if j ≤ 2 . 11. Thus, j can be 1 or 2, but nothing higher.
Now suppose i = 2. Then we require j ≤ (10 × 0 . 1 − 2 + 1 + 0 . 9)/0 . 9, Or j ≤ 1. We con-
clude that we must look in the buckets for ( a , 1), ( a , 2), and ( c , 1), but in no other bucket.
In comparison, using the buckets of Section 3.9.4 , we would look into the buckets for a
and c , which is equivalent to looking to all buckets ( a , j ) and ( c , j ) for any j .
3.9.6
Using Position and Length in Indexes
When we considered the upper limit on j in the previous section, we assumed that what fol-
lows positions i and j were as in Fig. 3.14 , where what followed these positions in strings
s and t matched exactly. We do not want to build an index that involves every symbol in
the strings, because that makes the total work excessive. However, we can add to our index
a summary of what follows the positions being indexed. Doing so expands the number of
buckets, but not beyond reasonable bounds, and yet enables us to eliminate many candidate
matches without comparing entire strings. The idea is to use index buckets corresponding
to a symbol, a position, and the suffix length , that is, the number of symbols following the
position in question.
EXAMPLE 3.28 The string s = acdefghijk , with J = 0 . 9, would be indexed in the buckets
for ( a , 1 , 9) and ( c , 2 , 8). That is, the first position of s has symbol a , and its suffix is of
length 9. The second position has symbol c and its suffix is of length 8.
Search WWH ::




Custom Search