Database Reference
In-Depth Information
Incremental HDFS
Chunk-2
Chunk-3
Chunk-1
Map-1
Map-2
Map-3
Memoization
server
Query location
of subcomputations
Reduce-1
Reduce-2
FIGURE 4.1
Basic design of Incoop.
normal MapReduce execution. To record and update the dependency graph implicitly,
our design includes a memoization server that stores a mapping from the input of a
previously run task to the location of the corresponding memoized output. When a task
completes, its output is memoized persistently, and a mapping from the input to the
location of the output is stored in the memoization server. Then, during an incremental
run, when a task is instantiated, the memoization server is queried to check if the inputs
to the task match those of a previous run. If so, the system reuses the outputs from the
previous run. Otherwise, the task runs normally and the mapping from its input to the
location of the newly produced output is stored in the memoization server.
This basic design raises a series of challenges, which we describe next. In sub-
sequent sections, we describe our key technical contributions that we propose to
address these challenges.
4.2.3 C hallenge : t ransParenCy
Self-adjusting computation requires knowing the modifications to the input to update
the output. To this end, it requires a new interface for making changes to the input, so
that the edits, which are clearly identified by the interface, can be used to trigger an
incremental update. We wish to achieve the efficiency benefits of self-adjusting com-
putation transparently without requiring the programmer to change the way they run
MapReduce computations. This goal seems to conflict with the fact that HDFS (the
system employed to store inputs to MapReduce computations in Hadoop) is an append-
only file system, making it impossible to convey input deltas. To overcome this chal-
lenge, we store the inputs and outputs of consecutive runs in separate HDFS files and
compute a delta between two HDFS files in a way that is scalable and performs well.
4.2.4 C hallenge : e FFiCienCy
To achieve efficient incremental updates, we must ensure that MapReduce computa-
tions remain stable under small changes to their input, meaning that, when executed
Search WWH ::




Custom Search