Database Reference
In-Depth Information
from the previous job and the map function of the next job cannot be overlapped as
the inal result of the reduce step cannot be produced until all map tasks have com-
pleted, which prevents effective pipelining. Therefore, the reducer treats the output
of a pipelined map task as tentative until the JobTracker informs the reducer that the
map task has committed successfully. The reducer can merge together spill iles gen-
erated by the same uncommitted mapper, but will not combine those spill iles with
the output of other map tasks until it has been notiied that the map task has com-
mitted. Thus, if a map task fails, each reduce task can ignore any tentative spill iles
produced by the failed map attempt. The JobTracker will take care of scheduling a
new map task attempt, as in standard Hadoop. In principle, the main limitation of the
MapReduce Online approach is that it is based on HDFS. Therefore, it is not suitable
for streaming applications, in which data streams have to be processed without any
disk involvement. A similar approach has been presented by Logothetis and Yocum
[94], which deines an incremental MapReduce job as one that processes data in
large batches of tuples and runs continuously according to a speciic window range
and slide of increment. In particular, it produces a MapReduce result that includes
all data within a window (of time or data size) of every slide and considers land-
mark MapReduce jobs where the trailing edge of the window is ixed and the system
incorporates new data into the existing result. Map functions are trivially continuous
and process data on a tuple-by-tuple basis. However, before the reduce function may
process the mapped data, the data must be partitioned across the reduce operators
and sorted. When the map operator irst receives a new key-value pair, it calls the
map function and inserts the result into the latest increment in the map results. The
operator then assigns output key-value pairs to reduce tasks, grouping them accord-
ing to the partition function. Continuous reduce operators participate in the sort as
well, grouping values by their keys before calling the reduce function.
The Incoop system [19] has been introduced as a MapReduce implementation
that has been adapted for incremental computations, which detects the changes on
the input dat sets and enables the automatic update of the outputs of the MapReduce
jobs by employing a ine-grained result reuse mechanism. In particular, it allows
MapReduce programs that are not designed for incremental processing to be executed
transparently in an incremental manner. To achieve this goal, the design of Incoop
introduces new techniques that are incorporated into the Hadoop MapReduce frame-
work. For example, instead of relying on HDFS to store the input to MapReduce
jobs, Incoop devises a ile system called Inc-HDFS (Incremental HDFS) that pro-
vides mechanisms to identify similarities in the input data of consecutive job runs.
In particular, Inc-HDFS splits the input into chunks whose boundaries depend on
the ile contents so that small changes to input do not change all chunk boundar-
ies. Therefore, this partitioning mechanism can maximize the opportunities for
reusing results from previous computations, while preserving compatibility with
HDFS by offering the same interface and semantics. In addition, Incoop controls
the granularity of tasks so that large tasks can be divided into smaller subtasks that
can be re-used  even when the large tasks cannot. Therefore, it introduces a new
Contraction phase that leverages Combiner functions to reduce the network trafic
by anticipating a small part of the processing done by the Reducer tasks and control
their granularity. Furthermore, Incoop improves the effectiveness of memoization
Search WWH ::




Custom Search