Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

run to the next. For instance, researchers have reported that the ratio between old and

new data when processing consecutive web crawls may range from 10 to 1000× [28].

Motivated by this observation, there have been several proposals for large-scale

incremental data-processing systems, such as Percolator [33] or CBP [28], to name a

few early and prominent examples. In these systems, the programmer is able to devise

an incremental update handler, which can store state across successive runs, and con-

tains the logic to update the output as the program is notified about input changes.

While this approach allows for significant improvements when compared with the

“single-shot” approach, that is, re-processing all the data each time that part of the

input changes or that inputs are added and deleted, it also has the downside of requir-

ing programmers to adopt a new programming model and API. This has two nega-

tive implications. First, there is the programming effort to port a large set of existing

applications to the new programming model. Second, it is often difficult to devise the

logic for incrementally updating the output as the input changes: research in the area

of dynamic algorithms (i.e., algorithms to solve problems that are formulated in terms

of deltas to their input) shows that such algorithms can be very complex, even in cases

where the normal, non-incremental algorithm was easy to devise [16,20].

In this chapter, we present an overview of our prior work on designing and build-

ing a system called Incoop for large-scale incremental computations [11]. Incoop

extends the Hadoop open-source implementation of the MapReduce paradigm to

run unmodified MapReduce programs in an incremental way. The design and imple-

mentation of Incoop is inspired by recent advances on self-adjusting computation

[3,5,6,15,24], which offers a solution to the problem of automatic incrementaliza-

tion of programs, and draws on techniques developed in that line of work. The idea

behind Incoop is to enable the programmer to incrementalize automatically existing

MapReduce programs without the need to make any modifications to the code. To

this end, Incoop records information about previously executed MapReduce tasks so

that it can be reused in future MapReduce computations when possible.

The basic approach taken by Incoop consists of (1) splitting the computation into

subcomputations, where the natural candidate for a subcomputation is a MapReduce

task; (2) memoizing the inputs and outputs of each subcomputation; and (3) in an

incremental run, checking the inputs to a subcomputation and using the memoized

output without rerunning the task when the input remains unchanged. Despite being

a good starting point, this basic approach has several shortcomings that motivated us

to introduce several technical innovations in Incoop, namely:

•

Incremental HDFS. We introduce a file system called Inc-HDFS that

provides a scalable way of identifying the deltas in the inputs of two con-

secutive job runs. This reuses an idea from the LBFS local file system [31],

which is to avoid splitting the input into fixed-size chunks, and instead split

it based on the contents such that small changes to the input keep most

chunk boundaries. The new file system is able to achieve a large reuse of

input chunks while maintaining compatibility with HDFS, which is the

most common interface to provide the input to a job in Hadoop.

•

Contraction phase. To avoid rerunning a large reduce task when only a

small subset of its input changes, we introduce a new phase in the MapReduce

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home