Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management - page 50

Database Reference

In-Depth Information

jobs and orchestrating their execution using a driver program. In practice, there are

two key problems with manually orchestrating an iterative program in MapReduce:

•

Even though much of the data may be unchanged from iteration to iteration,

the data must be reloaded and reprocessed at each iteration, wasting I/O,

network bandwidth, and CPU resources.

•

The termination condition may involve the detection of when a fixpoint has

been reached. This condition may itself require an extra MapReduce job on

each iteration, again incurring overhead in terms of scheduling extra tasks,

reading extra data from disk, and moving data across the network.

The HaLoop system [25,26] is designed to support iterative processing on the

MapReduce framework by extending the basic MapReduce framework with two

main functionalities:

1. Caching the invariant data in the first iteration and then reusing them in

later iterations.

2. Caching the reducer outputs, which makes checking for a fixpoint more

efficient, without an extra MapReduce job.

Figure 2.6 illustrates the architecture of HaLoop as a modified version of the

basic MapReduce framework. To accommodate the requirements of iterative data

Master

Slaves

Task11

Task12

Task13

Job 1

Job 2

Job 3

Task 21

Task 22

Task 23

Task 31

Task 32

Task 33

.

.

.

Task scheduler

Task tracker

Loop control

Caching

Indexing

Task queue

Distributed file system

Local file system

Local communication

Remote communication

Modified from Hadoop

Identical to Hadoop

New in HaLoop

FIGURE 2.6

An overview of HaLoop architecture. (From Y. Bu et al., PVLDB , 3(1), 285-

296, 2010.)

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home