Database Reference
In-Depth Information
framework called the contraction phase. This consists of breaking up the
Reduce task into smaller subcomputations that form an inverted tree, such
that, when a small portion of the input changes, only the path from the cor-
responding leaf to the root needs to be recomputed.
Memoization-aware scheduler. We modify the scheduler of Hadoop to
take advantage of the locality of memoized results. The new scheduler uses
a work stealing strategy to decrease the amount of data movement across
machines when reusing memoized outputs, while still allowing tasks to
execute on machines that are available.
We present an overview of the design and implementation of Incoop, and we
report experimental results using five MapReduce applications. The results from this
evaluation show that we achieve significant performance gains, while paying a mod-
est cost during the initial run or during runs that cannot take advantage of previously
computed results.
The rest of this chapter is organized as follows. We first present an overview of
Incoop in Section 4.2. The system design is detailed in Sections 4.3, 4.4, and 4.5. We
present the experimental evaluation in Section 4.6. Related work and conclusions are
discussed in Sections 4.7 and 4.8, respectively.
4.2 SYSTEM OVERVIEW
We present first a basic design that we use as a starting point, highlight the limita-
tions of this basic design, the challenges in overcoming them, and briefly overview
the main ideas behind Incoop, which addresses the limitations of the basic design.
Our basic strategy is to adapt the principles of self-adjusting computation to the
MapReduce paradigm, and in particular to Hadoop. We start with some background
on self-adjusting computation.
4.2.1 s elF -a DJusting C omPutation
Self-adjusting computation [3,5,6,15,24] offers a solution to the incremental-computation
problem by enabling any computation to respond to changes in its data by efficiently
recomputing only the subcomputations that are affected by the changes. To this end,
a self-adjusting computation tracks dependencies between the inputs and the out-
puts of subcomputations, and in incremental runs, only rebuilds subcomputations
affected (transitively) by modified inputs. To identify the affected subcomputations, the
approach  represents a computation as a dependency graph of subcomputations,
where two subcomputations are data-dependent if one of them uses the output of the
other as input and control-dependent if one takes place within the dynamic scope of
another. Subcomputations are also memoized based on their inputs to enable reuse
even if they are control-dependent on some affected subcomputation. Given the
“delta,” the modifications to the input, a change-propagation algorithm pushes the
modifications through the dependency graph, rebuilding affected subcomputations,
which it identifies based on both data and control dependencies. Before rebuilding a
subcomputation, change propagation recovers subcomputations that can be reused,
Search WWH ::




Custom Search