Database Reference
In-Depth Information
We take a combined approach instead. This method depends on the best-effort
analysis of the whole process of MapReduce processing framework, which will result
in the following cost function:
k
β
TS
(, ,)
D
C
=
h
(, ,)
S DC
+
β
,
(17.1)
ii
0
i
=
1
where h i ( S , D , C ) are possibly some nonlinear transformations of the input factors S ,
D , and C , which are the time complexity of sequential processing components in the
system, and β i are the component weights, different from application to application.
h i ( S , D , C ) are obtained through the analysis of the MapReduce processing compo-
nents, while β i will be learned for a specific application based on sample runs of that
application. With this modeling idea in mind, in the following subsections, we will
conduct the modeling analysis, give a concrete formulation of the cost functions of
map task and reduce task to find h i ( S , D , C ), and finally integrate these components
into the whole cost function.
17.3.1 a nalyzing the P roCess oF m aP r eDuCe
MapReduce processing is a mix of sequential and parallel processing. The map
phase is executed before the reduce phase,* as Figure 17.1 shows. However, in each
phase many map or reduce processes are executed in parallel. To clearly describe
the MapReduce execution, we would like to distinguish the concepts of Map/Reduce
slot and Map/Reduce process . Each map (or reduce) process is executed in a map (or
reduce) slot. A slot is a unit of computing resources allocated for the corresponding
process. According to the system capacity, a computing node can only accommodate a
fixed number of slots so that the parallel processes can be run in the slots without seri-
ous competition. In Hadoop, the Tasktracker running in each slave node has to set the
number of map slots and the number of reduce slots. A common setting for a multicore
computer is to have two map and reduce slots per core. Without loss of generality, let
us assume there are m map slots and r reduce slots in total over all slave nodes.
We define a map/reduce process as a map/reduce task running on a specific slot.
By default, in Hadoop each map process handles one chunk of data (e.g., 64 MB).
Therefore, if there are M chunks of data, M map processes in total will be scheduled
and assigned to the m slots. In the ideal case, m map processes can run in parallel in
the m slots—we call it one round of map processes. If M > m , which is normal for
large data sets, [ M / m ] map rounds are needed.
Different from the total number of map processes, the number of reduce pro-
cesses, denoted as R , can be set by the user or determined by specific application
requirements. The map outputs, that is, the key-value pairs, are organized by the keys
and then distributed evenly by the keys to the R reduce processes. Similarly, if
R  >  r , more than one round of reduce processes are scheduled. It is probably not very
* The copy operation in the reduce phase overlaps the map phase when a map's result is ready, copy
may start immediately.
Thus, it is not meaningful to set R greater than the number of output keys of map.
Search WWH ::




Custom Search