Databases Reference
In-Depth Information
However, observe that the more similar two columns are, the more likely it is
that they will be identical in some band. Thus, intuitively the banding strategy
makes similar columns much more likely to be candidate pairs than dissimilar
pairs.
2
3.4.2 Analysis of the Banding Technique
Suppose we use b bands of r rows each, and suppose that a particular pair of
documents have Jaccard similarity s. Recall from Section 3.3.3 that the prob-
ability the minhash signatures for these documents agree in any one particular
row of the signature matrix is s. We can calculate the probability that these
documents (or rather their signatures) become a candidate pair as follows:
1. The probability that the signatures agree in all rows of one particular
band is s r .
2. The probability that the signatures do not agree in at least one row of a
particular band is 1−s r .
3. The probability that the signatures do not agree in all rows of any of the
bands is (1−s r ) b .
4. The probability that the signatures agree in all the rows of at least one
band, and therefore become a candidate pair, is 1−(1−s r ) b .
It may not be obvious, but regardless of the chosen constants b and r, this
function has the form of an S-curve, as suggested in Fig. 3.7. The threshold, that
is, the value of similarity s at which the rise becomes steepest, is a function of
b and r. An approximation to the threshold is (1/b) 1/r . For example, if b = 16
and r = 4, then the threshold is approximately 1/2, since the 4th root of 16 is
2.
Example 3.11 : Let us consider the case b = 20 and r = 5. That is, we suppose
we have signatures of length 100, divided into twenty bands of five rows each.
Figure 3.8 tabulates some of the values of the function 1−(1−s 5 ) 20 . Notice
that the threshold, the value of s at which the curve has risen halfway, is just
slightly more than 0.5. Also notice that the curve is not exactly the ideal step
function that jumps from 0 to 1 at the threshold, but the slope of the curve
in the middle is significant. For example, it rises by more than 0.6 going from
s = 0.4 to s = 0.6, so the slope in the middle is greater than 3.
For example, at s = 0.8, 1−(0.8) 5 is about 0.328. If you raise this number
to the 20th power, you get about 0.00035. Subtracting this fraction from 1
yields 0.99965. That is, if we consider two documents with 80% similarity, then
in any one band, they have only about a 33% chance of agreeing in all five rows
and thus becoming a candidate pair. However, there are 20 bands and thus 20
chances to become a candidate. Only roughly one in 3000 pairs that are as high
as 80% similar will fail to become a candidate pair and thus be a false negative.
2
Search WWH ::




Custom Search