iMapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

To implement persistent tasks, there should be enough available task slots . The

number of available map/reduce task slots is the number of map/reduce tasks that

the framework can accommodate (allows to be executed) simultaneously. In Hadoop

MapReduce, the master splits a job into many small map/reduce tasks, the number

of map/reduce tasks executed simultaneously cannot be larger than the number of

the available map/reduce task slots (the default number in Hadoop is 2 per slave

worker). Once a slave worker completes an assigned task, it requests another one

from the master. iMapReduce requires to guarantee that there are sufficient available

task slots for all the persistent tasks to start at the beginning. This means that the

task granularity should be set coarser to have fewer tasks. Clearly, this might make

load balancing challenging. iMapReduce addresses this issue with a load balancing

scheme, which will be introduced later.

3.3.2.2 Passing State Data from Reduce to Map

In MapReduce, the output of a reduce task is written to DFS and might be used later

in the next MapReduce job. In contrast, iMapReduce allows the state data to be

passed from the reduce task to the map task directly, so as to trigger the join opera-

tion with the structure data and to start the map execution for the next iteration. To

do so, iMapReduce builds persistent socket connections from the reduce tasks to the

map tasks.

3.3.2.3 Joining State Data with Structure Data

iMapReduce differentiates the structure data from the state data and processes them

separately. iMapReduce will combine them before using them in the map operation.

By specifying the partition of the structure data and customizing the shuffling of the

state data correspondingly, a reduce task's output state KVs and their corresponding

structure KVs are always in the same map task. In each iteration, before executing

the map operation, for each state KV, a join operation that extracts its corresponding

structure KV is performed. Then they are used together in the map operation.

iMapReduce achieves automatic join by leveraging sorted files. With the support

of MapReduce framework, the state KVs from reduce task to map task are sorted

in the order of their keys. Suppose a structure key is SK and its corresponding state

key is cord ( SK ), iMapReduce sorts the structure KVs of each structure data partition

in the order of cord ( SK ). Therefore, given the sorted state data flow, by sequentially

parsing the structure data partition file once, all the corresponding state KVs and

structure KVs are extracted. The joined state KV and structure KV are provided to

map operation as the input parameters. Users can concentrate on implementing the

map operation, without worrying about the maintenance of the structure data in their

iterative algorithm implementations.

3.3.2.4 Asynchronous Execution of Map Tasks

In MapReduce iterative algorithm implementations, there are two synchronization

barriers in each iteration; one is between mappers and reducers and another one

is between MapReduce jobs. Due to the synchronization barrier between jobs, the

map tasks of the next iteration job cannot start before the completion of the previ-

ous iteration job, which requires the completion of all reduce tasks. However, since

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home