Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

merge together spill files generated by the same uncommitted mapper, but will

not combine those spill files with the output of other map tasks until it has been

notified that the map task has committed. Thus, if a map task fails, each reduce

task can ignore any tentative spill files produced by the failed map attempt. The

JobTracker will take care of scheduling a new map task attempt, as in standard

Hadoop. In principle, the main limitation of the MapReduce Online approach is

that it is based on HDFS. Therefore, it is not suitable for streaming applications, in

which data streams have to be processed without any disk involvement. A similar

approach has been presented by Logothetis and Yocum [ 179 ] which defines an

incremental MapReduce job as one that processes data in large batches of tuples

and runs continuously according to a specific window range and slide of increment.

In particular, it produces a MapReduce result that includes all data within a window

(of time or data size) of every slide and considers landmark MapReduce jobs where

the trailing edge of the window is fixed and the system incorporates new data into the

existing result. Map functions are trivially continuous, and process data on a tuple-

by-tuple basis. However, before the reduce function may process the mapped data,

the data must be partitioned across the reduce operators and sorted. When the map

operator first receives a new key-value pair, it calls the map function and inserts the

result into the latest increment in the map results. The operator then assigns output

key-value pairs to reduce tasks, grouping them according to the partition function.

Continuous reduce operators participate in the sort as well, grouping values by their

keys before calling the reduce function.

The Incoop system [ 81 ] has been introduced as a MapReduce implementation

that has been adapted for incremental computations which detects the changes

on the input datasets and enables the automatic update of the outputs of the

MapReduce jobs by employing a fine-grained result reuse mechanism. In particular,

it allows MapReduce programs which are not designed for incremental processing

to be executed transparently in an incremental manner. To achieve this goal, the

design of Incoop introduces new techniques that are incorporated into the Hadoop

MapReduce framework. For example, instead of relying on HDFS to store the input

to MapReduce jobs, Incoop devises a file system called Inc-HDFS (Incremental

HDFS) that provides mechanisms to identify similarities in the input data of

consecutive job runs. In particular, Inc-HDFS splits the input into chunks whose

boundaries depend on the file contents so that small changes to input do not

change all chunk boundaries. Therefore, this partitioning mechanism can maximize

the opportunities for reusing results from previous computations, while preserving

compatibility with HDFS by offering the same interface and semantics. In addition,

Incoop controls the granularity of tasks so that large tasks can be divided into

smaller subtasks that can be re-used even when the large tasks cannot. Therefore, it

introduces a new Contraction phase that leverages Combiner functions to reduce the

network traffic by anticipating a small part of the processing done by the Reducer

tasks and control their granularity. Furthermore, Incoop improves the effectiveness

of memoization by implementing an affinity-based scheduler that applies a work

stealing algorithm to minimize the amount of data movement across machines. This

modified scheduler strikes a balance between exploiting the locality of previously

Cloud Data Management

Search WWH ::

Custom Search

Home