Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

even partially, by using a computation memoization technique that remembers

(and re-uses) not just input-output relationships but also the dependency graphs of

memoized subcomputations [5].

The efficiency of a self-adjusting computation in responding to an input modifica-

tion is determined by the stability of the computation. Informally speaking, we call

a computation stable when the set of subcomputations performed on similar input

data sets themselves are similar, that is, many of the subcomputations are in fact the

same and thus can be reused. For a more precise definition of stability, we refer the

interested reader to the previous work [3,27]. One way to ensure stability is to make

sure that (1) the computation is divided into small subcomputations and (2) no long

chain of dependencies exists between computations. Since MapReduce framework

is naturally parallel, it naturally has short dependency chains at the granularity of

tasks. As we will see, however, it can yield unstable computations, because a small

change to the input can cause the input for many MapReduce tasks to change, and

because reduce tasks can be large.

For the sake of simplicity in design and implementation, Incoop, does not con-

struct the dependency graph of subcomputations explicitly, and thus does not perform

change propagation on the dependency graph. Instead, the graph is recorded implic-

itly by memoizing subcomputations—MapReduce tasks—and change propagation

is performed by revisiting all subcomputations and reusing those that can be reused

via memoization. While this approach simplifies the design and the implementation,

it can yield asymptotically suboptimal performance, because it requires touching all

subcomputations (for the purposes of memoization and reuse) even if they may not

be affected by the input modifications. Since, however, subcomputations (tasks) are

relatively large, the cost of reusing an unaffected subcomputation is small compared

with rebuilding it. The approach therefore can perform well in practice (Section 4.6).

4.2.2 b asiC D esign

Our goal is to design a system for large-scale incremental data processing that is

able to leverage the performance benefits of incremental computation, while also

being transparent, meaning that it does not require changes to existing programs. In

particular, we consider MapReduce programs for distributed large-scale data pro-

cessing. We assume the reader is familiar with the MapReduce paradigm, and refer

to prior publications on the subject for more information [19].

To achieve this goal, we apply the principles of self-adjusting computation to the

MapReduce paradigm. To this end, we first need to decide what forms a subcompu-

tation. The natural candidate in our system is to use MapReduce tasks as subcom-

putations; this makes it possible to view the data-flow graph of the MapReduce job

as a subgraph of the dependency graph, which in addition has control dependencies.

Since MapReduce frameworks implicitly keep track of this graph when implement-

ing the data movement and synchronization between the various tasks, building the

dependency graph becomes easy and natural.

This decision leads to our basic design, which is shown in Figure 4.1. In this design,

the MapReduce scheduler orchestrates the execution of every MapReduce job nor-

mally, by spawning and synchronizing tasks and performing data movement as in a

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home