Database Reference
In-Depth Information
3.4.4
Exercises for Section 3.4
EXERCISE 3.4.1 Evaluate the S-curve 1 − (1 − s r ) b for s = 0 . 1 , 0 . 2 , . . . , 0 . 9, for the following
values of r and b :
r = 3 and b = 10.
r = 6 and b = 20.
r = 5 and b = 50.
! EXERCISE 3.4.2 For each of the ( r, b ) pairs in Exercise 3.4.1 , compute the threshold, that
is, the value of s for which the value of 1−(1− s r ) b is exactly 1/2. How does this value com-
pare with the estimate of (1/ b ) 1 /r that was suggested in Section 3.4.2 ?
! EXERCISE 3.4.3 Use the techniques explained in Section 1.3.5 to approximate the S-curve
1 − (1 − s r ) b when s r is very small.
! EXERCISE 3.4.4 Suppose we wish to implement LSH by MapReduce. Specifically, as-
sume chunks of the signature matrix consist of columns, and elements are key-value pairs
where the key is the column number and the value is the signature itself (i.e., a vector of
values).
(a) Show how to produce the buckets for all the bands as output of a single MapReduce
process. Hint : Remember that a Map function can produce several key-value pairs
from a single element.
(b) Show how another MapReduce process can convert the output of (a) to a list of pairs
that need to be compared. Specifically, for each column i , there should be a list of those
columns j > i with which i needs to be compared.
3.5 Distance Measures
We now take a short detour to study the general notion of distance measures. The Jaccard
similarity is a measure of how close sets are, although it is not really a distance measure.
That is, the closer sets are, the higher the Jaccard similarity. Rather, 1 minus the Jaccard
similarity is a distance measure, as we shall see; it is called the Jaccard distance .
However, Jaccard distance is not the only measure of closeness that makes sense. We
shall examine in this section some other distance measures that have applications. Then,
in Section 3.6 we see how some of these distance measures also have an LSH technique
Search WWH ::




Custom Search