iMapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

starting the iterative process from scratch. Since the state data are relatively small, it

is expected to consume little time for dumping these data to DFS (i.e., make several

copies on several other machines for data redundancy). Note that the checkpointing

process is performed in parallel with the iterative processing.

3.3.5 l oaD b alanCing

In MapReduce, the master decomposes a submitted job into multiple tasks. The slave

worker completes one task followed by requesting another one from the master. This

“complete-and-then-feed” task scheduling mechanism makes good use of computing

resources. In iMapReduce, all tasks are assigned to slave workers in the beginning at

one time, since tasks are persistent in iMapReduce. This one-time assignment con-

flicts with MapReduce's task scheduling strategy, so that iMapReduce cannot confer

the benefit from the original MapReduce framework.

Lack of load balancing support may lead to several problems: (1) Even though

the initial input data are evenly partitioned among workers, it does not necessarily

mean that the computation workload is evenly distributed due to the skewed degree

distribution. (2) Even though the computation workload is evenly distributed among

workers, it still cannot guarantee the best utilization of computing resources, since a

large cluster might consist of heterogeneous servers [19].

To address these problems, iMapReduce performs task migration when the work-

load is unbalanced among workers. After each iteration, each reduce task sends an

iteration completion report to the master, which contains the reduce task id, the itera-

tion number, and the processing time of that iteration. Upon receipt of all the reduce

tasks' reports, the master calculates the average processing time. Based on the aver-

age time, the master calculates the time deviation of each worker and identifies the

slower workers and the faster workers. If the time deviation is larger than a threshold,

a MRPair in the slowest worker is migrated to the fastest worker in the following

three steps. The master (1) kills an MRPair in the slow worker, (2) launches a new

MRPair in the fast worker, and (3) sends a rollback command to the other map/

reduce tasks. All the map/reduce tasks that receive the rollback command skip their

current processing. The rolled back map tasks reload the latest checkpointed state

data from DFS and proceed. The new launched map tasks load not only the state data

but also the corresponding structure data from DFS.

3.4 ALGORITHM IMPLEMENTATION

iMapReduce supports the implementation of iterative algorithms and is compatible

with traditional Hadoop MapReduce. Users can turn on iterative processing func-

tionalities for implementing iterative algorithms or turn them off for implementing

Hadoop MapReduce jobs as usual.

3.4.1 i m aP r eDuCe aPi

To implement an iterative algorithm in iMapReduce, users should implement the

following interfaces, which is slightly changed from Hadoop API (written in Java):

Search WWH ::

Custom Search

Home