Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

in the size of the original input, even if the delta in the input is small. In fact it may

be argued that the larger the reduce task the more likely it is that a part of its input

may change. To prevent this stability problem, we need to find a way to control the

granularity of the subcomputations in the reduce phase, and organize these subcom-

putations in way that avoids creating a long dependence chain between subcomputa-

tions, otherwise, a single newly computed subcomputation could also trigger a large

amount of recomputation.

To reduce the granularity of reduce tasks, we propose a new contraction phase ,

which is run by reduce tasks. This new phase takes advantage of combiners , a feature

of the MapReduce framework [18], also implemented by Hadoop, which originally

aims at saving bandwidth by offloading part of the computation performed by the

reduce task to the map task. To this end, the programmer specifies a combiner func-

tion, which is invoked by the map task, and pre-processes a part of the map output,

that is, a set of 〈key,va lue〉 pairs, merging them into a smaller number of pairs. The

signature of the combiner function uses the same input and output type to be inter-

posed between the map and reduce phase. Its inputs and output arguments are a

sequence of 〈key,va lue〉 pairs. In all the MapReduce applications we analyzed so far,

the combiners and the reduce functions perform similar work.

The contraction phase uses combiners to break up reduce tasks into several applica-

tions of the combine function. In particular, we start by splitting the reduce input into

chunks, and apply the combine function to each chunk. Then we recursively form chunks

from the aggregate result of all the combine invocations and apply the combine function

to these new chunks. The data size gets smaller in each level, and in the last level, we

apply the reduce function to the output of all the combiners from the second to last level.

Given the signature of combiner functions we described before, it is syntacti-

cally correct to interpose any number of combiner invocations between the map and

Reduce functions. However, semantically, combiners are invoked by the MapReduce

or Hadoop frameworks at most once per key/value pair that is output by a map task,

and therefore MapReduce programs are only required to ensure the correctness of

the MapReduce computation for a single combiner invocation, that is,

R ◦ C ◦ M = R ◦ M

where R , C , and M represent the reduce, combiner, and map function, respectively.

Our new use of combiner functions introduces a different requirement, namely,

R ◦ C n ◦ M = R ◦ M , ∀ n > 0

It is conceivable to write a combiner that meets the original requirement but not

the new one. However, we found that, in practice, all of the combiner functions we

have seen obey the new requirement.

Stability of the contraction phase. When deciding how to partition the input to

the contraction phase, the same issue that was faced by the map phase arises: if a

part of the input to the contraction phase is removed or a new part is added, then a

fixed-size partitioning of the input would not ensure the stability of the dependence

graph. This problem is illustrated in Figure 4.4, which shows two consecutive runs of

Search WWH ::

Custom Search

Home