Database Reference
In-Depth Information
(2) The probability that the signatures do not agree in at least one row of a particular band
is 1 − s r .
(3) The probability that the signatures do not agree in all rows of any of the bands is (1 −
s r ) b .
(4) The probability that the signatures agree in all the rows of at least one band, and there-
fore become a candidate pair, is 1 − (1 − s r ) b .
It may not be obvious, but regardless of the chosen constants b and r , this function has
the form of an S-curve , as suggested in Fig. 3.7 . The threshold , that is, the value of simil-
arity s at which the probability of becoming a candidate is 1/2, is a function of b and r . The
threshold is roughly where the rise is the steepest, and for large b and r there we find that
pairs with similarity above the threshold are very likely to become candidates, while those
below the threshold are unlikely to become candidates - exactly the situation we want.
An approximation to the threshold is (1/ b ) 1 /r . For example, if b = 16 and r = 4, then the
threshold is approximately at s = 1/2, since the 4th root of 1/16 is 1/2.
Figure 3.7 The S-curve
EXAMPLE 3.11 Let us consider the case b = 20 and r = 5. That is, we suppose we have
signatures of length 100, divided into twenty bands of five rows each. Figure 3.8 tabulates
some of the values of the function 1 − (1 − s 5 ) 20 . Notice that the threshold, the value of s at
which the curve has risen halfway, is just slightly more than 0.5. Also notice that the curve
is not exactly the ideal step function that jumps from 0 to 1 at the threshold, but the slope
of the curve in the middle is significant. For example, it rises by more than 0.6 going from
s = 0 . 4 to s = 0 . 6, so the slope in the middle is greater than 3.
Figure 3.8 Values of the S-curve for b = 20 and r = 5
Search WWH ::




Custom Search