Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

memoization techniques for iterative computations. Incoop improves on these pro-

posals using a set of principles from related work to identify and overcome the situa-

tions where task-level memoization is inefficient, such as the stable input partitioning

or the contraction phase.

Our own short position paper [10] makes the case for applying techniques

inspired by self-adjusting computation to large-scale data processing in general and

uses MapReduce as an example. This position paper, however, models MapReduce

in a sequential, single-machine implementation of self-adjusting computation called

CEAL [24] and does not offer a full-scale distributed design and implementation

such as the system we presented.

Finally, the idea of breaking up the work of a reduce task using combiner func-

tions was also adopted for the purpose of minimizing network bandwidth overheads

[17,29].

Stream processing systems. Comet [25] proposed the Batched Stream Processing

(BSP) model, where queries are triggered upon appending a chunk of data to an input

stream. In contrast to Comet, we are compatible with the MapReduce model and

focus on several issues like controlling task granularity or input partitioning that do

not arise in Comet's model.

NOVA [32] allow for the incremental execution of Pig programs as new data

continuously arrives. Much like the work on incremental view maintenance, NOVA

introduces a workflow manager that rewrites the computation to identify the parts of

the computation affected by incremental changes and produce the necessary update

function, which in turn runs on top of the existing Pig/Hadoop framework. However,

as noted by the authors of NOVA, an alternative, more efficient design would be to

modify the underlying Hadoop system to support this functionality. This is precisely

that path we took in Incoop. Furthermore, our work is more general since it can be

applied to any MapReduce program and not only the programs produced by the Pig

framework.

Other systems such as Storm [2] or S4 [1] work at a finer granularity by triggering

a processing routine each time a (possibly small) input arrives. Deduce [26] presents

a hybrid solution for real-time data analytics using a combination of techniques from

batch processing in MapReduce and streaming processing in IBM's System S. In

contrast to these proposals, the focus of Incoop is to provide incremental processing

in batch processing workloads.

4.8 CONCLUSION

This chapter presents a set of principles and techniques for performing incremental

MapReduce computations. We build on prior work from several fields, most notably

contributions from the programming languages community on self-adjusting com-

putation, a technique for incremental computations, which we extend to large-scale

parallel computations. The resulting system combines the efficiency of recomputing

a small subset of subcomputations affected by each input change with the ability to

transparently execute existing MapReduce computations. This work thus has the

potential to bring substantial performance improvements to MapReduce computa-

tions when these computations are repeated on evolving data.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home