Database Reference
In-Depth Information
User
program
(1) fork
(1) fork
(1) fork
Master
(2)
assign
map
(2)
assign
reduce
Worker
Split 0
Split 1
Split 2
Split 3
Split 4
(6) write
Output
file 0
Worker
(4) local write
(3) read
Worker
Output
file 1
Worker
Worker
Input files
Map phase
Intermediate files
(on local disk)
Reduce phase
Output
files
FIGURE 2.2
An overview of the flow of execution a MapReduce Operation. (From J. Dean
and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in
OSDI
,
pp. 137-150, 2004.)
by other workers. Completed map tasks are re-executed on a task failure because
their output is stored on the local disk(s) of the failed machine and is therefore inac-
cessible. Completed reduce tasks do not need to be re-executed since their output is
stored in a global file system.
2.3 EXTENSIONS AND ENHANCEMENTS OF
THE MapReduce FRAMEWORK
In practice, the basic implementation of the MapReduce is very useful for handling
data processing and data loading in a heterogeneous system with many different stor-
age systems. Moreover, it provides a flexible framework for the execution of more
complicated functions than that can be directly supported in SQL. However, this
basic architecture suffers from some limitations. Dean and Ghemawa [45] reported
about some possible improvements that can be incorporated into the MapReduce
framework. Examples of these possible improvements include the following:
•
MapReduce should take advantage of natural indices whenever possible.
•
Most MapReduce output can be left unmerged since there is no benefit of
merging them if the next consumer is just another MapReduce program.
•
MapReduce users should avoid using inefficient textual formats.
Search WWH ::
Custom Search