Database Reference
In-Depth Information
memoization techniques for iterative computations. Incoop improves on these pro-
posals using a set of principles from related work to identify and overcome the situa-
tions where task-level memoization is inefficient, such as the stable input partitioning
or the contraction phase.
Our own short position paper [10] makes the case for applying techniques
inspired by self-adjusting computation to large-scale data processing in general and
uses MapReduce as an example. This position paper, however, models MapReduce
in a sequential, single-machine implementation of self-adjusting computation called
CEAL [24] and does not offer a full-scale distributed design and implementation
such as the system we presented.
Finally, the idea of breaking up the work of a reduce task using combiner func-
tions was also adopted for the purpose of minimizing network bandwidth overheads
[17,29].
Stream processing systems. Comet [25] proposed the Batched Stream Processing
(BSP) model, where queries are triggered upon appending a chunk of data to an input
stream. In contrast to Comet, we are compatible with the MapReduce model and
focus on several issues like controlling task granularity or input partitioning that do
not arise in Comet's model.
NOVA [32] allow for the incremental execution of Pig programs as new data
continuously arrives. Much like the work on incremental view maintenance, NOVA
introduces a workflow manager that rewrites the computation to identify the parts of
the computation affected by incremental changes and produce the necessary update
function, which in turn runs on top of the existing Pig/Hadoop framework. However,
as noted by the authors of NOVA, an alternative, more efficient design would be to
modify the underlying Hadoop system to support this functionality. This is precisely
that path we took in Incoop. Furthermore, our work is more general since it can be
applied to any MapReduce program and not only the programs produced by the Pig
framework.
Other systems such as Storm [2] or S4 [1] work at a finer granularity by triggering
a processing routine each time a (possibly small) input arrives. Deduce [26] presents
a hybrid solution for real-time data analytics using a combination of techniques from
batch processing in MapReduce and streaming processing in IBM's System S. In
contrast to these proposals, the focus of Incoop is to provide incremental processing
in batch processing workloads.
4.8 CONCLUSION
This chapter presents a set of principles and techniques for performing incremental
MapReduce computations. We build on prior work from several fields, most notably
contributions from the programming languages community on self-adjusting com-
putation, a technique for incremental computations, which we extend to large-scale
parallel computations. The resulting system combines the efficiency of recomputing
a small subset of subcomputations affected by each input change with the ability to
transparently execute existing MapReduce computations. This work thus has the
potential to bring substantial performance improvements to MapReduce computa-
tions when these computations are repeated on evolving data.
Search WWH ::




Custom Search