Database Reference
In-Depth Information
only one particular algorithm for one-pass matrix multiplication, we shall see that it is part
of a spectrum of algorithms, and in fact represents an extreme point, where q is as small as
can be, and r is at its maximum. More generally, there is a tradeoff between r and q that
can be expressed as qr ≥ 2 n 2 .
2.6.2
An Example: Similarity Joins
To see the tradeoff between r and q in a realistic situation, we shall examine a problem
known as similarity join . In this problem, we are given a large set of elements X and a sim-
ilarity measure s ( x, y ) that tells how similar two elements x and y of set X are. In Chapter 3
we shall learn about the most important notions of similarity and also learn some tricks that
let us find similar pairs quickly. But here, we shall consider only the raw form of the prob-
lem, where we have to look at each pair of elements of X and determine their similarity by
applying the function s . We assume that s is symmetric, so s ( x, y ) = s ( y, x ), but we assume
nothing else about s . The output of the algorithm is those pairs whose similarity exceeds a
given threshold t .
For example, let us suppose we have a collection of one million images, each of size one
megabyte. Thus, the dataset has size one terabyte. We shall not try to describe the similarity
function s , but it might, say, involve giving higher values when images have roughly the
same distribution of colors or when images have corresponding regions with the same dis-
tribution of colors. The goal would be to discover pairs of images that show the same type
of object or scene. This problem is extremely hard, but classifying by color distribution is
generally of some help toward that goal.
Let us look at how we might do the computation using MapReduce to exploit the natural
parallelism found in this problem. The input is key-value pairs ( i, P i ), where i is an ID for
the picture and P i is the picture itself. We want to compare each pair of pictures, so let us
use one key for each set of two IDs { i, j }. There are approximately 5×10 11 pairs of two
IDs. We want each key { i, j } to be associated with the two values P i and P j , so the input
to the corresponding reducer will be ({ i, j } , [ P i , P j ]). Then, the Reduce function can simply
apply the similarity function s to the two pictures on its value list; that is, compute s ( P i , P j ),
and decide whether the similarity of the two pictures is above threshold. The pair would be
output if so.
Alas, this algorithm will fail completely. The reducer size is small, since no list has more
than two values, or a total of 2MB of input. Although we don't know exactly how the sim-
ilarity function s operates, we can reasonably expect that it will not require more than the
available main memory. However, the replication rate is 999,999, since for each picture we
generate that number of key-value pairs, one for each of the other pictures in the dataset.
Search WWH ::




Custom Search