Database Reference
In-Depth Information
Here, we shall look only at the join R ( A, B ) S ( B, C ) T ( C, D ) as an example. Suppose
that the relations R , S , and T have sizes r , s , and t , respectively, and for simplicity, suppose
that p is the probability that:
(1) an R -tuple and and S -tuple agree on B ; and also the probability that
(2) an S -tuple and a T -tuple agree on C .
If we join R and S first, using the MapReduce algorithm of Section 2.3.7 , then the com-
munication cost is O ( r + s ), and the size of the intermediate join R S is prs . When we join
this result with T , the communication of this second MapReduce job is O ( t + prs ). Thus,
the entire communication cost of the algorithm consisting of two 2-way joins is O ( r + s +
t + prs ). If we instead join S and T first, and then join R with the result, we get another
algorithm whose communication cost is O ( r + s + t + pst ).
A third way to take this join is to use a single MapReduce job that joins the three re-
lations at once. Suppose that we plan to use k reducers for this job. Pick numbers b and
c representing the number of buckets into which we shall hash B - and C -values, respect-
ively. Let h be a hash function that sends B -values into b buckets, and let g be another hash
function that sends C -values into c buckets. We require that bc = k ; that is, each reducer
corresponds to a pair of buckets, one for the B -value and one for the C -value. The reducer
corresponding to bucket pair ( i, j ) is responsible for joining the tuples R ( u, v ), S ( v, w ), and
T ( w, x ) whenever h ( v ) = i and g ( w ) = j .
As a result, the Map tasks that send tuples of R , S , and T to the reducers that need them
must send R - and T -tuples to more than one reducer. For an S -tuple S ( v, w ), we know the
B - and C -values, so we can send this tuple only to the reducer for ( h ( v ) , g ( w )). However,
consider an R -tuple R ( u, v ). We know it only needs to go to reducers that correspond to
( h ( v ) , y ), for some y . But we don't know y ; the value of C could be anything as far as we
know. Thus, we must send R ( u, v ) to c reducers, since y could be any of the c buckets for
C -values. Similarly, we must send the T -tuple T ( w, x ) to each of the reducers ( z, g ( w )) for
any z . There are b such reducers.
EXAMPLE 2.9 Suppose that b = c = 4, so k = 16. The sixteen reducers can be thought of as
arranged in a rectangle, as suggested by Fig. 2.8 . There, we see a hypothetical S -tuple S ( v,
w ) for which h ( v ) = 2 and g ( w ) = 1. This tuple is sent by its Map task only to the reducer for
key (2 , 1). We also see an R -tuple R ( u, v ). Since h ( v ) = 2, this tuple is sent to all reducers
(2 , y ), for y = 1 , 2 , 3 , 4. Finally, we see a T -tuple T ( w, x ). Since g ( w ) = 1, this tuple is sent
to all reducers ( z, 1) for z = 1 , 2 , 3 , 4. Notice that these three tuples join, and they meet at
exactly one reducer, the reducer for key (2 , 1).
Search WWH ::




Custom Search