Database Reference
In-Depth Information
even partially, by using a computation memoization technique that remembers
(and re-uses) not just input-output relationships but also the dependency graphs of
memoized subcomputations [5].
The efficiency of a self-adjusting computation in responding to an input modifica-
tion is determined by the stability of the computation. Informally speaking, we call
a computation stable when the set of subcomputations performed on similar input
data sets themselves are similar, that is, many of the subcomputations are in fact the
same and thus can be reused. For a more precise definition of stability, we refer the
interested reader to the previous work [3,27]. One way to ensure stability is to make
sure that (1) the computation is divided into small subcomputations and (2) no long
chain of dependencies exists between computations. Since MapReduce framework
is naturally parallel, it naturally has short dependency chains at the granularity of
tasks. As we will see, however, it can yield unstable computations, because a small
change to the input can cause the input for many MapReduce tasks to change, and
because reduce tasks can be large.
For the sake of simplicity in design and implementation, Incoop, does not con-
struct the dependency graph of subcomputations explicitly, and thus does not perform
change propagation on the dependency graph. Instead, the graph is recorded implic-
itly by memoizing subcomputations—MapReduce tasks—and change propagation
is performed by revisiting all subcomputations and reusing those that can be reused
via memoization. While this approach simplifies the design and the implementation,
it can yield asymptotically suboptimal performance, because it requires touching all
subcomputations (for the purposes of memoization and reuse) even if they may not
be affected by the input modifications. Since, however, subcomputations (tasks) are
relatively large, the cost of reusing an unaffected subcomputation is small compared
with rebuilding it. The approach therefore can perform well in practice (Section 4.6).
4.2.2 b asiC D esign
Our goal is to design a system for large-scale incremental data processing that is
able to leverage the performance benefits of incremental computation, while also
being transparent, meaning that it does not require changes to existing programs. In
particular, we consider MapReduce programs for distributed large-scale data pro-
cessing. We assume the reader is familiar with the MapReduce paradigm, and refer
to prior publications on the subject for more information [19].
To achieve this goal, we apply the principles of self-adjusting computation to the
MapReduce paradigm. To this end, we first need to decide what forms a subcompu-
tation. The natural candidate in our system is to use MapReduce tasks as subcom-
putations; this makes it possible to view the data-flow graph of the MapReduce job
as a subgraph of the dependency graph, which in addition has control dependencies.
Since MapReduce frameworks implicitly keep track of this graph when implement-
ing the data movement and synchronization between the various tasks, building the
dependency graph becomes easy and natural.
This decision leads to our basic design, which is shown in Figure 4.1. In this design,
the MapReduce scheduler orchestrates the execution of every MapReduce job nor-
mally, by spawning and synchronizing tasks and performing data movement as in a
Search WWH ::




Custom Search