MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

only one particular algorithm for one-pass matrix multiplication, we shall see that it is part

of a spectrum of algorithms, and in fact represents an extreme point, where q is as small as

can be, and r is at its maximum. More generally, there is a tradeoff between r and q that

can be expressed as qr ≥ 2 n 2 .

□

2.6.2

An Example: Similarity Joins

To see the tradeoff between r and q in a realistic situation, we shall examine a problem

known as similarity join . In this problem, we are given a large set of elements X and a sim-

ilarity measure s ( x, y ) that tells how similar two elements x and y of set X are. In Chapter 3

we shall learn about the most important notions of similarity and also learn some tricks that

let us find similar pairs quickly. But here, we shall consider only the raw form of the prob-

lem, where we have to look at each pair of elements of X and determine their similarity by

applying the function s . We assume that s is symmetric, so s ( x, y ) = s ( y, x ), but we assume

nothing else about s . The output of the algorithm is those pairs whose similarity exceeds a

given threshold t .

For example, let us suppose we have a collection of one million images, each of size one

megabyte. Thus, the dataset has size one terabyte. We shall not try to describe the similarity

function s , but it might, say, involve giving higher values when images have roughly the

same distribution of colors or when images have corresponding regions with the same dis-

tribution of colors. The goal would be to discover pairs of images that show the same type

of object or scene. This problem is extremely hard, but classifying by color distribution is

generally of some help toward that goal.

Let us look at how we might do the computation using MapReduce to exploit the natural

parallelism found in this problem. The input is key-value pairs ( i, P i ), where i is an ID for

the picture and P i is the picture itself. We want to compare each pair of pictures, so let us

use one key for each set of two IDs { i, j }. There are approximately 5×10 11 pairs of two

IDs. We want each key { i, j } to be associated with the two values P i and P j , so the input

to the corresponding reducer will be ({ i, j } , [ P i , P j ]). Then, the Reduce function can simply

apply the similarity function s to the two pictures on its value list; that is, compute s ( P i , P j ),

and decide whether the similarity of the two pictures is above threshold. The pair would be

output if so.

Alas, this algorithm will fail completely. The reducer size is small, since no list has more

than two values, or a total of 2MB of input. Although we don't know exactly how the sim-

ilarity function s operates, we can reasonably expect that it will not require more than the

available main memory. However, the replication rate is 999,999, since for each picture we

generate that number of key-value pairs, one for each of the other pictures in the dataset.

Search WWH ::

Custom Search

Home