MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

sire to shrink the wall-clock time and to execute each reducer in main memory. Recall that

a “reducer” is the execution of the Reduce function on a single key and its associated value

list. The point of the exploration in this section is that for many problems there is a spec-

trum of MapReduce algorithms requiring different amounts of communication. Moreover,

the less communication an algorithm uses, the worse it may be in other respects, including

wall-clock time and the amount of main memory it requires.

2.6.1

Reducer Size and Replication Rate

Let us now introduce the two parameters that characterize families of MapReduce al-

gorithms. The first is the reducer size , which we denote by q . This parameter is the upper

bound on the number of values that are allowed to appear in the list associated with a single

key. Reducer size can be selected with at least two goals in mind:

(1) By making the reducer size small, we can force there to be many reducers, i.e., many

different keys according to which the problem input is divided by the Map tasks. If we

also create many Reduce tasks - even one for each reducer - then there will be a high

degree of parallelism, and we can look forward to a low wall-clock time.

(2) We can choose a reducer size sufficiently small that we are certain the computation

associated with a single reducer can be executed entirely in the main memory of the

compute node where its Reduce task is located. Regardless of the computation done

by the reducers, the running time will be greatly reduced if we can avoid having to

move data repeatedly between main memory and disk.

The second parameter is the replication rate , denoted r . We define r to be the number

of key-value pairs produced by all the Map tasks on all the inputs, divided by the number

of inputs. That is, the replication rate is the average communication from Map tasks to Re-

duce tasks (measured by counting key-value pairs) per input.

EXAMPLE 2.11 Let us consider the one-pass matrix-multiplication algorithm of Section

2.3.10 . Suppose that all the matrices involved are n × n matrices. Then the replication rate r

is equal to n . That fact is easy to see, since for each element m ij , there are n key-value pairs

produced; these have all keys of the form ( i, k ), for 1 ≤ k ≤ n . Likewise, for each element of

the other matrix, say n jk , we produce n key-value pairs, each having one of the keys ( i, k ),

for 1 ≤ i ≤ n . In this case, not only is n the average number of key-value pairs produced for

an input element, but each input produces exactly this number of pairs.

We also see that q , the required reducer size, is 2 n . That is, for each key ( i, k ), there are n

key-value pairs representing elements m ij of the first matrix and another n key-value pairs

derived from the elements n jk of the second matrix. While this pair of values represents

Search WWH ::

Custom Search

Home