Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

from the previous job and the map function of the next job cannot be overlapped as

the inal result of the reduce step cannot be produced until all map tasks have com-

pleted, which prevents effective pipelining. Therefore, the reducer treats the output

of a pipelined map task as tentative until the JobTracker informs the reducer that the

map task has committed successfully. The reducer can merge together spill iles gen-

erated by the same uncommitted mapper, but will not combine those spill iles with

the output of other map tasks until it has been notiied that the map task has com-

mitted. Thus, if a map task fails, each reduce task can ignore any tentative spill iles

produced by the failed map attempt. The JobTracker will take care of scheduling a

new map task attempt, as in standard Hadoop. In principle, the main limitation of the

MapReduce Online approach is that it is based on HDFS. Therefore, it is not suitable

for streaming applications, in which data streams have to be processed without any

disk involvement. A similar approach has been presented by Logothetis and Yocum

[94], which deines an incremental MapReduce job as one that processes data in

large batches of tuples and runs continuously according to a speciic window range

and slide of increment. In particular, it produces a MapReduce result that includes

all data within a window (of time or data size) of every slide and considers land-

mark MapReduce jobs where the trailing edge of the window is ixed and the system

incorporates new data into the existing result. Map functions are trivially continuous

and process data on a tuple-by-tuple basis. However, before the reduce function may

process the mapped data, the data must be partitioned across the reduce operators

and sorted. When the map operator irst receives a new key-value pair, it calls the

map function and inserts the result into the latest increment in the map results. The

operator then assigns output key-value pairs to reduce tasks, grouping them accord-

ing to the partition function. Continuous reduce operators participate in the sort as

well, grouping values by their keys before calling the reduce function.

The Incoop system [19] has been introduced as a MapReduce implementation

that has been adapted for incremental computations, which detects the changes on

the input dat sets and enables the automatic update of the outputs of the MapReduce

jobs by employing a ine-grained result reuse mechanism. In particular, it allows

MapReduce programs that are not designed for incremental processing to be executed

transparently in an incremental manner. To achieve this goal, the design of Incoop

introduces new techniques that are incorporated into the Hadoop MapReduce frame-

work. For example, instead of relying on HDFS to store the input to MapReduce

jobs, Incoop devises a ile system called Inc-HDFS (Incremental HDFS) that pro-

vides mechanisms to identify similarities in the input data of consecutive job runs.

In particular, Inc-HDFS splits the input into chunks whose boundaries depend on

the ile contents so that small changes to input do not change all chunk boundar-

ies. Therefore, this partitioning mechanism can maximize the opportunities for

reusing results from previous computations, while preserving compatibility with

HDFS by offering the same interface and semantics. In addition, Incoop controls

the granularity of tasks so that large tasks can be divided into smaller subtasks that

can be re-used even when the large tasks cannot. Therefore, it introduces a new

Contraction phase that leverages Combiner functions to reduce the network trafic

by anticipating a small part of the processing done by the Reducer tasks and control

their granularity. Furthermore, Incoop improves the effectiveness of memoization