Database Reference
In-Depth Information
(2) The probability that the signatures do not agree in at least one row of a particular band
is 1 −
s
r
.
(3) The probability that the signatures do not agree in all rows of any of the bands is (1 −
s
r
)
b
.
(4) The probability that the signatures agree in all the rows of at least one band, and there-
fore become a candidate pair, is 1 − (1 −
s
r
)
b
.
It may not be obvious, but regardless of the chosen constants
b
and
r
, this function has
the form of an
S-curve
, as suggested in
Fig. 3.7
. The
threshold
, that is, the value of simil-
arity
s
at which the probability of becoming a candidate is 1/2, is a function of
b
and
r
. The
threshold is roughly where the rise is the steepest, and for large
b
and
r
there we find that
pairs with similarity above the threshold are very likely to become candidates, while those
below the threshold are unlikely to become candidates - exactly the situation we want.
An approximation to the threshold is (1/
b
)
1
/r
. For example, if
b
= 16 and
r
= 4, then the
threshold is approximately at
s
= 1/2, since the 4th root of 1/16 is 1/2.
Figure 3.7
The S-curve
EXAMPLE
3.11 Let us consider the case
b
= 20 and
r
= 5. That is, we suppose we have
signatures of length 100, divided into twenty bands of five rows each.
Figure 3.8
tabulates
some of the values of the function 1 − (1 −
s
5
)
20
. Notice that the threshold, the value of
s
at
which the curve has risen halfway, is just slightly more than 0.5. Also notice that the curve
is not exactly the ideal step function that jumps from 0 to 1 at the threshold, but the slope
of the curve in the middle is significant. For example, it rises by more than 0.6 going from
s
= 0
.
4 to
s
= 0
.
6, so the slope in the middle is greater than 3.
Figure 3.8
Values of the S-curve for
b
= 20 and
r
= 5