Database Reference
In-Depth Information
run to the next. For instance, researchers have reported that the ratio between old and
new data when processing consecutive web crawls may range from 10 to 1000× [28].
Motivated by this observation, there have been several proposals for large-scale
incremental data-processing systems, such as Percolator [33] or CBP [28], to name a
few early and prominent examples. In these systems, the programmer is able to devise
an incremental update handler, which can store state across successive runs, and con-
tains the logic to update the output as the program is notified about input changes.
While this approach allows for significant improvements when compared with the
“single-shot” approach, that is, re-processing all the data each time that part of the
input changes or that inputs are added and deleted, it also has the downside of requir-
ing programmers to adopt a new programming model and API. This has two nega-
tive implications. First, there is the programming effort to port a large set of existing
applications to the new programming model. Second, it is often difficult to devise the
logic for incrementally updating the output as the input changes: research in the area
of dynamic algorithms (i.e., algorithms to solve problems that are formulated in terms
of deltas to their input) shows that such algorithms can be very complex, even in cases
where the normal, non-incremental algorithm was easy to devise [16,20].
In this chapter, we present an overview of our prior work on designing and build-
ing a system called Incoop for large-scale incremental computations [11]. Incoop
extends the Hadoop open-source implementation of the MapReduce paradigm to
run unmodified MapReduce programs in an incremental way. The design and imple-
mentation of Incoop is inspired by recent advances on self-adjusting computation
[3,5,6,15,24], which offers a solution to the problem of automatic incrementaliza-
tion of programs, and draws on techniques developed in that line of work. The idea
behind Incoop is to enable the programmer to incrementalize automatically existing
MapReduce programs without the need to make any modifications to the code. To
this end, Incoop records information about previously executed MapReduce tasks so
that it can be reused in future MapReduce computations when possible.
The basic approach taken by Incoop consists of (1) splitting the computation into
subcomputations, where the natural candidate for a subcomputation is a MapReduce
task; (2) memoizing the inputs and outputs of each subcomputation; and (3) in an
incremental run, checking the inputs to a subcomputation and using the memoized
output without rerunning the task when the input remains unchanged. Despite being
a good starting point, this basic approach has several shortcomings that motivated us
to introduce several technical innovations in Incoop, namely:
Incremental HDFS. We introduce a file system called Inc-HDFS that
provides a scalable way of identifying the deltas in the inputs of two con-
secutive job runs. This reuses an idea from the LBFS local file system [31],
which is to avoid splitting the input into fixed-size chunks, and instead split
it based on the contents such that small changes to the input keep most
chunk boundaries. The new file system is able to achieve a large reuse of
input chunks while maintaining compatibility with HDFS, which is the
most common interface to provide the input to a job in Hadoop.
Contraction phase. To avoid rerunning a large reduce task when only a
small subset of its input changes, we introduce a new phase in the MapReduce
Search WWH ::




Custom Search