Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

architecture. The task tracker not only manages task execution but also manages

caches and indices on the slave node, and redirects each task's cache and index

accesses to local file system.

In the MapReduce framework, each map or reduce task contains its portion of

the input data and the task runs by performing the map/reduce function on its input

data records where the life cycle of the task ends when finishing the processing of

all the input data records has been completed. The iMapReduce framework [ 240 ]

supports the feature of iterative processing by keeping alive each map and reduce

task during the whole iterative process. In particular, when all of the input data of a

persistent task are parsed and processed, the task becomes dormant, waiting for the

new updated input data. For a map task, it waits for the results from the reduce tasks

and is activated to work on the new input records when the required data from the

reduce tasks arrive. For the reduce tasks, they wait for the map tasks' output and are

activated synchronously as in MapReduce. Jobs can terminate their iterative process

in one of two ways:

1. Defining fixed number of iterations : Iterative algorithm stops after it iterates n

times.

2. Bounding the distance between two consecutive iterations : Iterative algorithm

stops when the distance is less than a threshold.

The iMapReduce runtime system does the termination check after each iteration.

To terminate the iterations by a fixed number of iterations, the persistent map/reduce

task records its iteration number and terminates itself when the number exceeds a

threshold. To bound the distance between the output from two consecutive iterations,

the reduce tasks can save the output from two consecutive iterations and compute

the distance. If the termination condition is satisfied, the master will notify all the

map and reduce tasks to terminate their execution.

Other projects have been implemented for supporting iterative processing on the

MapReduce framework. For example, Twister [ 50 ] is a MapReduce runtime with

an extended programming model that supports iterative MapReduce computations

efficiently [ 125 ]. It uses a publish/subscribe messaging infrastructure for communi-

cation and data transfers, and supports long running map/reduce tasks. In particular,

it provides programming extensions to MapReduce with broadcast and scatter type

data transfers. Microsoft has also developed a project that provides an iterative

MapReduce runtime for Windows Azure called Daytona [ 37 ].

Data and Process Sharing

With the emergence of cloud computing, the use of an analytical query processing

infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value.

Taking into account that different MapReduce jobs can perform similar work, there

could be many opportunities for sharing the execution of their work. Thus, this

sharing can reduce the overall amount of work which consequently leads to the

Search WWH ::

Custom Search

Home