MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

m ij we generate g key-value pairs with value equal to m ij , together with its row and column

numbers, i and j , to identify the matrix element. There is one key-value pair for each key

( I, J, K ), where K can be any of the g groups of columns of N . Similarly, from element n jk

of N , if j belongs to group J and k to group K , the Map function generates g key-value pairs

with value consisting of n jk , j , and k , and with keys ( I, J, K ) for any group I .

The Reduce Function The reducer corresponding to ( I, J, K ) receives as input all the ele-

ments m ij where i is in I and j is in J , and it also receives all the elements n jk , where j is in

J and k is in K . It computes

for all i in I and k in K .

Notice that the replication rate for the first MapReduce job is g , and the total commu-

nication is therefore 2 gn 2 . Also notice that each reducer gets 2 n 2 /g 2 inputs, so q = 2 n 2 /g 2 .

Equivalently, Thus, the total communication 2 gn 2 can be written in terms of q as

The second MapReduce job is simple; it sums up the x iJk over all sets J .

The Map Function We assume that the Map tasks execute at whatever compute nodes ex-

ecuted the Reduce tasks of the previous job. Thus, no communication is needed between

the jobs. The Map function takes as input one element x iJk , which we assume the previous

reducers have left labeled with i and k so we know to what element of matrix P this term

contributes. One key-value pair is generated. The key is ( i, k ) and the value is x iJk .

The Reduce Function The Reduce function simply sums the values associated with key

( i, k ) to compute the output element P ik .

The communication between the Map and Reduce tasks of the second job is gn 2 , since

there are n possible values of i , n possible values of k , and g possible values of the set J , and

each x iJk is communicated only once. If we recall from our analysis of the first MapReduce

job that we can write the communication for the second job as This amount is

exactly half the communication for the first job, so the total communication for the two-

pass algorithm is Although we shall not examine this point here, it turns out that we

can do slightly better if we divide the matrices M and N not into squares but into rectangles

that are twice as long on one side as on the other. In that case, we get the slightly smaller

constant 4 in place of

and we get a two-pass algorithm with communication equal to

Now, recall that the replication rate we computed for the one-pass algorithm is 4 n 4 /q . We

may as well assume q is less than n 2 , or else we can just use a serial algorithm at one com-

Search WWH ::

Custom Search

Home