Database Reference
In-Depth Information
member f of F ′ is constructed from b members of F , say f 1 , f 2 , . . . , f b . We define f ( x ) =
f ( y ) if and only if f i ( x ) = f i ( y ) for one or more values of i . The OR-construction mirrors the
effect of combining several bands: x and y become a candidate pair if any band makes them
a candidate pair.
If p is the probability that a member of F will declare ( x, y ) to be a candidate pair, then 1
p is the probability it will not so declare. (1 − p ) b is the probability that none of f 1 , f 2 , . . .
, f b will declare ( x, y ) a candidate pair, and 1 − (1 − p ) b is the probability that at least one f i
will declare ( x, y ) a candidate pair, and therefore that f will declare ( x, y ) to be a candidate
pair.
Notice that the AND-construction lowers all probabilities, but if we choose F and r judi-
ciously, we can make the small probability p 2 get very close to 0, while the higher probab-
ility p 1 stays significantly away from 0. Similarly, the OR-construction makes all probabil-
ities rise, but by choosing F and b judiciously, we can make the larger probability approach
1 while the smaller probability remains bounded away from 1. We can cascade AND- and
OR-constructions in any order to make the low probability close to 0 and the high prob-
ability close to 1. Of course the more constructions we use, and the higher the values of r
and b that we pick, the larger the number of functions from the original family that we are
forced to use. Thus, the better the final family of functions is, the longer it takes to apply
the functions from this family.
EXAMPLE 3.18 Suppose we start with a family F . We use the AND-construction with r =
4 to produce a family F 1 . We then apply the OR-construction to F 1 with b = 4 to produce
a third family F 2 . Note that the members of F 2 each are built from 16 members of F , and
the situation is analogous to starting with 16 minhash functions and treating them as four
bands of four rows each.
The 4-way AND-function converts any probability p into p 4 . When we follow it by the
4-way OR-construction, that probability is further converted into 1 − (1 − p 4 ) 4 . Some val-
ues of this transformation are indicated in Fig. 3.10 . This function is an S-curve, staying
low for a while, then rising steeply (although not too steeply; the slope never gets much
higher than 2), and then leveling off at high values. Like any S-curve, it has a fixedpoint ,
the value of p that is left unchanged when we apply the function of the S-curve. In this case,
the fixedpoint is the value of p for which p = 1 − (1 − p 4 ) 4 . We can see that the fixedpoint is
somewhere between 0.7 and 0.8. Below that value, probabilities are decreased, and above
it they are increased. Thus, if we pick a high probability above the fixedpoint and a low
probability below it, we shall have the desired effect that the low probability is decreased
and the high probability is increased.
Search WWH ::




Custom Search