Database Reference
In-Depth Information
jobs and orchestrating their execution using a driver program. In practice, there are
two key problems with manually orchestrating an iterative program in MapReduce:
Even though much of the data may be unchanged from iteration to iteration,
the data must be reloaded and reprocessed at each iteration, wasting I/O,
network bandwidth, and CPU resources.
The termination condition may involve the detection of when a fixpoint has
been reached. This condition may itself require an extra MapReduce job on
each iteration, again incurring overhead in terms of scheduling extra tasks,
reading extra data from disk, and moving data across the network.
The HaLoop system [25,26] is designed to support iterative processing on the
MapReduce framework by extending the basic MapReduce framework with two
main functionalities:
1. Caching the invariant data in the first iteration and then reusing them in
later iterations.
2. Caching the reducer outputs, which makes checking for a fixpoint more
efficient, without an extra MapReduce job.
Figure 2.6 illustrates the architecture of HaLoop as a modified version of the
basic MapReduce framework. To accommodate the requirements of iterative data
Master
Slaves
Task11
Task12
Task13
Job 1
Job 2
Job 3
Task 21
Task 22
Task 23
Task 31
Task 32
Task 33
.
.
.
Task scheduler
Task tracker
Loop control
Caching
Indexing
Task queue
Distributed file system
Local file system
Local communication
Remote communication
Modified from Hadoop
Identical to Hadoop
New in HaLoop
FIGURE 2.6
An overview of HaLoop architecture. (From Y. Bu et al., PVLDB , 3(1), 285-
296, 2010.)
Search WWH ::




Custom Search