Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management - page 130

Database Reference

In-Depth Information

Incremental HDFS

Chunk-2

Chunk-3

Chunk-1

Map-1

Map-2

Map-3

Memoization

server

Query location

of subcomputations

Reduce-1

Reduce-2

FIGURE 4.1

Basic design of Incoop.

normal MapReduce execution. To record and update the dependency graph implicitly,

our design includes a memoization server that stores a mapping from the input of a

previously run task to the location of the corresponding memoized output. When a task

completes, its output is memoized persistently, and a mapping from the input to the

location of the output is stored in the memoization server. Then, during an incremental

run, when a task is instantiated, the memoization server is queried to check if the inputs

to the task match those of a previous run. If so, the system reuses the outputs from the

previous run. Otherwise, the task runs normally and the mapping from its input to the

location of the newly produced output is stored in the memoization server.

This basic design raises a series of challenges, which we describe next. In sub-

sequent sections, we describe our key technical contributions that we propose to

address these challenges.

4.2.3 C hallenge : t ransParenCy

Self-adjusting computation requires knowing the modifications to the input to update

the output. To this end, it requires a new interface for making changes to the input, so

that the edits, which are clearly identified by the interface, can be used to trigger an

incremental update. We wish to achieve the efficiency benefits of self-adjusting com-

putation transparently without requiring the programmer to change the way they run

MapReduce computations. This goal seems to conflict with the fact that HDFS (the

system employed to store inputs to MapReduce computations in Hadoop) is an append-

only file system, making it impossible to convey input deltas. To overcome this chal-

lenge, we store the inputs and outputs of consecutive runs in separate HDFS files and

compute a delta between two HDFS files in a way that is scalable and performs well.

4.2.4 C hallenge : e FFiCienCy

To achieve efficient incremental updates, we must ensure that MapReduce computa-

tions remain stable under small changes to their input, meaning that, when executed

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home