MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

Here, we shall look only at the join R ( A, B ) ⋈ S ( B, C ) ⋈ T ( C, D ) as an example. Suppose

that the relations R , S , and T have sizes r , s , and t , respectively, and for simplicity, suppose

that p is the probability that:

(1) an R -tuple and and S -tuple agree on B ; and also the probability that

(2) an S -tuple and a T -tuple agree on C .

If we join R and S first, using the MapReduce algorithm of Section 2.3.7 , then the com-

munication cost is O ( r + s ), and the size of the intermediate join R ⋈ S is prs . When we join

this result with T , the communication of this second MapReduce job is O ( t + prs ). Thus,

the entire communication cost of the algorithm consisting of two 2-way joins is O ( r + s +

t + prs ). If we instead join S and T first, and then join R with the result, we get another

algorithm whose communication cost is O ( r + s + t + pst ).

A third way to take this join is to use a single MapReduce job that joins the three re-

lations at once. Suppose that we plan to use k reducers for this job. Pick numbers b and

c representing the number of buckets into which we shall hash B - and C -values, respect-

ively. Let h be a hash function that sends B -values into b buckets, and let g be another hash

function that sends C -values into c buckets. We require that bc = k ; that is, each reducer

corresponds to a pair of buckets, one for the B -value and one for the C -value. The reducer

corresponding to bucket pair ( i, j ) is responsible for joining the tuples R ( u, v ), S ( v, w ), and

T ( w, x ) whenever h ( v ) = i and g ( w ) = j .

As a result, the Map tasks that send tuples of R , S , and T to the reducers that need them

must send R - and T -tuples to more than one reducer. For an S -tuple S ( v, w ), we know the

B - and C -values, so we can send this tuple only to the reducer for ( h ( v ) , g ( w )). However,

consider an R -tuple R ( u, v ). We know it only needs to go to reducers that correspond to

( h ( v ) , y ), for some y . But we don't know y ; the value of C could be anything as far as we

know. Thus, we must send R ( u, v ) to c reducers, since y could be any of the c buckets for

C -values. Similarly, we must send the T -tuple T ( w, x ) to each of the reducers ( z, g ( w )) for

any z . There are b such reducers.

EXAMPLE 2.9 Suppose that b = c = 4, so k = 16. The sixteen reducers can be thought of as

arranged in a rectangle, as suggested by Fig. 2.8 . There, we see a hypothetical S -tuple S ( v,

w ) for which h ( v ) = 2 and g ( w ) = 1. This tuple is sent by its Map task only to the reducer for

key (2 , 1). We also see an R -tuple R ( u, v ). Since h ( v ) = 2, this tuple is sent to all reducers

(2 , y ), for y = 1 , 2 , 3 , 4. Finally, we see a T -tuple T ( w, x ). Since g ( w ) = 1, this tuple is sent

to all reducers ( z, 1) for z = 1 , 2 , 3 , 4. Notice that these three tuples join, and they meet at

exactly one reducer, the reducer for key (2 , 1).

□

Search WWH ::

Custom Search

Home