Database Reference
In-Depth Information
sire to shrink the wall-clock time and to execute each reducer in main memory. Recall that
a “reducer” is the execution of the Reduce function on a single key and its associated value
list. The point of the exploration in this section is that for many problems there is a spec-
trum of MapReduce algorithms requiring different amounts of communication. Moreover,
the less communication an algorithm uses, the worse it may be in other respects, including
wall-clock time and the amount of main memory it requires.
2.6.1
Reducer Size and Replication Rate
Let us now introduce the two parameters that characterize families of MapReduce al-
gorithms. The first is the reducer size , which we denote by q . This parameter is the upper
bound on the number of values that are allowed to appear in the list associated with a single
key. Reducer size can be selected with at least two goals in mind:
(1) By making the reducer size small, we can force there to be many reducers, i.e., many
different keys according to which the problem input is divided by the Map tasks. If we
also create many Reduce tasks - even one for each reducer - then there will be a high
degree of parallelism, and we can look forward to a low wall-clock time.
(2) We can choose a reducer size sufficiently small that we are certain the computation
associated with a single reducer can be executed entirely in the main memory of the
compute node where its Reduce task is located. Regardless of the computation done
by the reducers, the running time will be greatly reduced if we can avoid having to
move data repeatedly between main memory and disk.
The second parameter is the replication rate , denoted r . We define r to be the number
of key-value pairs produced by all the Map tasks on all the inputs, divided by the number
of inputs. That is, the replication rate is the average communication from Map tasks to Re-
duce tasks (measured by counting key-value pairs) per input.
EXAMPLE 2.11 Let us consider the one-pass matrix-multiplication algorithm of Section
2.3.10 . Suppose that all the matrices involved are n × n matrices. Then the replication rate r
is equal to n . That fact is easy to see, since for each element m ij , there are n key-value pairs
produced; these have all keys of the form ( i, k ), for 1 ≤ k n . Likewise, for each element of
the other matrix, say n jk , we produce n key-value pairs, each having one of the keys ( i, k ),
for 1 ≤ i n . In this case, not only is n the average number of key-value pairs produced for
an input element, but each input produces exactly this number of pairs.
We also see that q , the required reducer size, is 2 n . That is, for each key ( i, k ), there are n
key-value pairs representing elements m ij of the first matrix and another n key-value pairs
derived from the elements n jk of the second matrix. While this pair of values represents
Search WWH ::




Custom Search