iMapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

column k , then the product P = MN is the matrix P with element p ik in row i and

column k , where p

= ∑ .

In the MapReduce framework, we use two map-reduce phases to perform the

matrix multiplication,* which forms each iteration. In the first phase, the map

extracts the columns of M and the rows of N , and the reduce joins column j of M and

row j of N together. In the second phase, the map multiplies a column vector with the

joined row vector to obtain a matrix, and the reduce sums these matrices to obtain

the final result matrix. The map and reduce operations in each iteration are described

as follows.

m n

MPI Map 1: For each key ( i , j ), send each matrix element m ij of M to the KV

〈 j , ( M , i , m ij )〉. For each key ( j , k ), send each matrix element n jk of N to the

KV 〈 j , ( N , k , n jk )〉.

MPI Reduce 1: For each key j , collect its list of associated values

(

) (

)

(

)

(

)

Mim

, ,

Mi m



N kn

Nk n

,  .

k 2

MPI Map 2: Take the output KV of Reduce 1. For each value that comes from

M , say ( M , i , m ij ), and each value that comes from N , say ( N , k , n jk ), produce

the KV 〈( i , k ), m ij n jk 〉. It will output all the permutations (, ),

ik mn

21 2 1 , …

MPI Reduce 2: For each key ( i , k ), produce the sum of the list of values associ-

ated with this key. The result KV is 〈( i , k ), p ik 〉, where p

(, ),

ik mn

,…, (, ),

ik mn

ij jk

= ∑ .

m n

3.1.3 P roPerties oF i iterative C omPutations in m aP r eDuCe

Through these examples, we have several observations about writing iterative algo-

rithms in Hadoop MapReduce.

• Iterative Data Flow. MapReduce exploits a batched processing model. By

using MapReduce to implement an iterative algorithm, the iterative loop is

missing. On the other hand, to emulate an iterative data flow, an iterative

algorithm is implemented by a series of MapReduce jobs. Each iteration

corresponds to one or more jobs. The next job's input is the previous job's

output. The map/reduce operations in all iterations are the same.

• Dynamic State Data vs. Static Structure Data. Iterative computation

involves two kinds of data, the state data and the structure data . During the

iterative computation, the state data is iteratively updated (e.g., the PageRank

scores in PageRank, the centroids in K-means, or the matrix multiplier N = M k −1

in MPI), while the structure data is iteration-invariant (e.g., the Web linkage

graph in PageRank, the point coordinates in K-means, or the basic matrix M

in MPI). To process these data in MapReduce, both the dynamic state data and

the static structure data are represented by KVs (i.e., state KVs and structure

* http://infolab.stanford.edu/ullman/mmds/book.pdf.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home