Databases Reference
In-Depth Information
whenever p≥q.
Case 2: p < q. Here, the maximum size of the intersection is L
s
−i + 1, as
when su
x length was not considered. However, the minimum size of the union
is now L
s
+ j−1 + q−p. If we again use the relationship L
s
= i + p, we can
replace L
s
−p by i and get the formula i + j−1 + q for the size of the union.
If the Jaccard similarity is at least J , then
L
s
−i + 1
i + j−1 + q
≥J
whenever p < q.
Example 3.29 : Let us again consider the string s =
acdefghijk
, but to make
the example show some details, let us choose J = 0.8 instead of 0.9. We know
that L
s
= 10.
⌋+ 1 = 3, we must consider prefix positions
i = 1, 2, and 3 in what follows. As before, let p be the su
x length of s and q
the su
x length of t.
First, consider the case p≥q. The additional constraint we have on q and
j is (q + 1)/(9 + j)≥0.8. We can enumerate the pairs of values of j and q for
each i between 1 and 3, as follows.
Since⌊(1−J )L
s
i = 1: Here, p = 9, so q≤9. Let us consider the possible values of q:
q = 9: We must have 10/(9 + j)≥0.8. Thus, we can have j = 1, j = 2,
or j = 3. Note that for j = 4, 10/13 > 0.8.
q = 8: We must have 9/(9 + j)≥0.8. Thus, we can have j = 1 or j = 2.
For j = 3, 9/12 > 0.8.
q = 7: We must have 8/(9 + j)≥0.8. Only j = 1 satisfies this inequality.
q = 6: There are no possible values of j, since 7/(9 + j) > 0.8 for every
positive integer j. The same holds for every smaller value of q.
i = 2: Here, p = 8, so we require q≤8. Since the constraint (q + 1)/(9 + j≥0.8
does not depend on i,
6
we can use the analysis from the above case, but
exclude the case q = 9. Thus, the only possible values of j and q when
i = 2 are
1. q = 8; j = 1.
2. q = 8; j = 2.
3. q = 7; j = 1.
i = 3: Now, p = 7 and the constraints are q≤7 and (q + 1)/(9 + j)≥0.8. The
only option is q = 7 and j = 1.
6
Note that i does influence the value of p, and through p, puts a limit on q.