Database Reference
In-Depth Information
in the size of the original input, even if the delta in the input is small. In fact it may
be argued that the larger the reduce task the more likely it is that a part of its input
may change. To prevent this stability problem, we need to find a way to control the
granularity of the subcomputations in the reduce phase, and organize these subcom-
putations in way that avoids creating a long dependence chain between subcomputa-
tions, otherwise, a single newly computed subcomputation could also trigger a large
amount of recomputation.
To reduce the granularity of reduce tasks, we propose a new contraction phase ,
which is run by reduce tasks. This new phase takes advantage of combiners , a feature
of the MapReduce framework [18], also implemented by Hadoop, which originally
aims at saving bandwidth by offloading part of the computation performed by the
reduce task to the map task. To this end, the programmer specifies a combiner func-
tion, which is invoked by the map task, and pre-processes a part of the map output,
that is, a set of 〈key,va lue〉 pairs, merging them into a smaller number of pairs. The
signature of the combiner function uses the same input and output type to be inter-
posed between the map and reduce phase. Its inputs and output arguments are a
sequence of 〈key,va lue〉 pairs. In all the MapReduce applications we analyzed so far,
the combiners and the reduce functions perform similar work.
The contraction phase uses combiners to break up reduce tasks into several applica-
tions of the combine function. In particular, we start by splitting the reduce input into
chunks, and apply the combine function to each chunk. Then we recursively form chunks
from the aggregate result of all the combine invocations and apply the combine function
to these new chunks. The data size gets smaller in each level, and in the last level, we
apply the reduce function to the output of all the combiners from the second to last level.
Given the signature of combiner functions we described before, it is syntacti-
cally correct to interpose any number of combiner invocations between the map and
Reduce functions. However, semantically, combiners are invoked by the MapReduce
or Hadoop frameworks at most once per key/value pair that is output by a map task,
and therefore MapReduce programs are only required to ensure the correctness of
the MapReduce computation for a single combiner invocation, that is,
R C M = R M
where R , C , and M represent the reduce, combiner, and map function, respectively.
Our new use of combiner functions introduces a different requirement, namely,
R C n M = R M , ∀ n > 0
It is conceivable to write a combiner that meets the original requirement but not
the new one. However, we found that, in practice, all of the combiner functions we
have seen obey the new requirement.
Stability of the contraction phase. When deciding how to partition the input to
the contraction phase, the same issue that was faced by the map phase arises: if a
part of the input to the contraction phase is removed or a new part is added, then a
fixed-size partitioning of the input would not ensure the stability of the dependence
graph. This problem is illustrated in Figure 4.4, which shows two consecutive runs of
Search WWH ::




Custom Search