Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

analysis applications, HaLoop has incorporated the following changes to the basic

Hadoop MapReduce framework:

•

It exposes a new application programming interface to users that simplifies

the expression of iterative MapReduce programs.

•

HaLoop's master node contains a new loop control module that repeat-

edly starts new map-reduce steps that compose the loop body until a user-

specified stopping condition is met.

•

It uses a new task scheduler that leverages data locality.

•

It caches and indices application data on slave nodes. In principle, the task

tracker not only manages task execution but also manages caches and indi-

ces on the slave node and redirects each task's cache and index accesses to

local file system.

In principle, HaLoop relies on the same file system and has the same task queue struc-

ture as Hadoop but the task scheduler and task tracker modules are modified, and the

loop control, caching, and indexing modules are newly introduced to the architecture.

The task tracker not only manages task execution but also manages caches and indices on

the slave node, and redirects each task's cache and index accesses to the local file system.

In the MapReduce framework, each map or reduce task contains its portion of the

input data and the task runs by performing the map/reduce function on its input data

records where the life cycle of the task ends when finishing the processing of all the input

data records has been completed. The iMapReduce framework [138] supports the feature

of iterative processing by keeping alive each map and reduce task during the whole itera-

tive process. In particular, when all of the input data of a persistent task are parsed and

processed, the task becomes dormant, waiting for the new updated input data. For a map

task, it waits for the results from the reduce tasks and is activated to work on the new

input records when the required data from the reduce tasks arrive. For the reduce tasks,

they wait for the map tasks' output and are activated synchronously as in MapReduce.

Jobs can terminate their iterative process in one of two ways:

1. Defining fixed number of iterations : Iterative algorithm stops after it iter-

ates n times.

2. Bounding the distance between two consecutive iterations : Iterative algo-

rithm stops when the distance is less than a threshold.

The iMapReduce runtime system does the termination check after each iteration.

To terminate the iterations by a fixed number of iterations, the persistent map/reduce

task records its iteration number and terminates itself when the number exceeds a

threshold. To bound the distance between the output from two consecutive iterations,

the reduce tasks can save the output from two consecutive iterations and compute the

distance. If the termination condition is satisfied, the master will notify all the map

and reduce tasks to terminate their execution.

Other projects have been implemented for supporting iterative processing on the

MapReduce framework. For example, Twister* is a MapReduce runtime with an

* http://www.iterativemapreduce.org/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home