Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

framework called the contraction phase. This consists of breaking up the

Reduce task into smaller subcomputations that form an inverted tree, such

that, when a small portion of the input changes, only the path from the cor-

responding leaf to the root needs to be recomputed.

•

Memoization-aware scheduler. We modify the scheduler of Hadoop to

take advantage of the locality of memoized results. The new scheduler uses

a work stealing strategy to decrease the amount of data movement across

machines when reusing memoized outputs, while still allowing tasks to

execute on machines that are available.

We present an overview of the design and implementation of Incoop, and we

report experimental results using five MapReduce applications. The results from this

evaluation show that we achieve significant performance gains, while paying a mod-

est cost during the initial run or during runs that cannot take advantage of previously

computed results.

The rest of this chapter is organized as follows. We first present an overview of

Incoop in Section 4.2. The system design is detailed in Sections 4.3, 4.4, and 4.5. We

present the experimental evaluation in Section 4.6. Related work and conclusions are

discussed in Sections 4.7 and 4.8, respectively.

4.2 SYSTEM OVERVIEW

We present first a basic design that we use as a starting point, highlight the limita-

tions of this basic design, the challenges in overcoming them, and briefly overview

the main ideas behind Incoop, which addresses the limitations of the basic design.

Our basic strategy is to adapt the principles of self-adjusting computation to the

MapReduce paradigm, and in particular to Hadoop. We start with some background

on self-adjusting computation.

4.2.1 s elF -a DJusting C omPutation

Self-adjusting computation [3,5,6,15,24] offers a solution to the incremental-computation

problem by enabling any computation to respond to changes in its data by efficiently

recomputing only the subcomputations that are affected by the changes. To this end,

a self-adjusting computation tracks dependencies between the inputs and the out-

puts of subcomputations, and in incremental runs, only rebuilds subcomputations

affected (transitively) by modified inputs. To identify the affected subcomputations, the

approach represents a computation as a dependency graph of subcomputations,

where two subcomputations are data-dependent if one of them uses the output of the

other as input and control-dependent if one takes place within the dynamic scope of

another. Subcomputations are also memoized based on their inputs to enable reuse

even if they are control-dependent on some affected subcomputation. Given the

“delta,” the modifications to the input, a change-propagation algorithm pushes the

modifications through the dependency graph, rebuilding affected subcomputations,

which it identifies based on both data and control dependencies. Before rebuilding a

subcomputation, change propagation recovers subcomputations that can be reused,

Search WWH ::

Custom Search

Home