Database Reference
In-Depth Information
merge together spill files generated by the same uncommitted mapper, but will
not combine those spill files with the output of other map tasks until it has been
notified that the map task has committed. Thus, if a map task fails, each reduce
task can ignore any tentative spill files produced by the failed map attempt. The
JobTracker will take care of scheduling a new map task attempt, as in standard
Hadoop. In principle, the main limitation of the MapReduce Online approach is
that it is based on HDFS. Therefore, it is not suitable for streaming applications, in
which data streams have to be processed without any disk involvement. A similar
approach has been presented by Logothetis and Yocum [ 179 ] which defines an
incremental MapReduce job as one that processes data in large batches of tuples
and runs continuously according to a specific window range and slide of increment.
In particular, it produces a MapReduce result that includes all data within a window
(of time or data size) of every slide and considers landmark MapReduce jobs where
the trailing edge of the window is fixed and the system incorporates new data into the
existing result. Map functions are trivially continuous, and process data on a tuple-
by-tuple basis. However, before the reduce function may process the mapped data,
the data must be partitioned across the reduce operators and sorted. When the map
operator first receives a new key-value pair, it calls the map function and inserts the
result into the latest increment in the map results. The operator then assigns output
key-value pairs to reduce tasks, grouping them according to the partition function.
Continuous reduce operators participate in the sort as well, grouping values by their
keys before calling the reduce function.
The Incoop system [ 81 ] has been introduced as a MapReduce implementation
that has been adapted for incremental computations which detects the changes
on the input datasets and enables the automatic update of the outputs of the
MapReduce jobs by employing a fine-grained result reuse mechanism. In particular,
it allows MapReduce programs which are not designed for incremental processing
to be executed transparently in an incremental manner. To achieve this goal, the
design of Incoop introduces new techniques that are incorporated into the Hadoop
MapReduce framework. For example, instead of relying on HDFS to store the input
to MapReduce jobs, Incoop devises a file system called Inc-HDFS (Incremental
HDFS) that provides mechanisms to identify similarities in the input data of
consecutive job runs. In particular, Inc-HDFS splits the input into chunks whose
boundaries depend on the file contents so that small changes to input do not
change all chunk boundaries. Therefore, this partitioning mechanism can maximize
the opportunities for reusing results from previous computations, while preserving
compatibility with HDFS by offering the same interface and semantics. In addition,
Incoop controls the granularity of tasks so that large tasks can be divided into
smaller subtasks that can be re-used even when the large tasks cannot. Therefore, it
introduces a new Contraction phase that leverages Combiner functions to reduce the
network traffic by anticipating a small part of the processing done by the Reducer
tasks and control their granularity. Furthermore, Incoop improves the effectiveness
of memoization by implementing an affinity-based scheduler that applies a work
stealing algorithm to minimize the amount of data movement across machines. This
modified scheduler strikes a balance between exploiting the locality of previously
Search WWH ::




Custom Search