Database Reference
In-Depth Information
Map task
Read
Map artition/sort
Combine
Block
HDFS
Local
disk
HDFS
File
Pull data
Reduce task
Copy
Sort
ReduceWriteBack
FIGURE 17.1
Components in map and reduce tasks and the sequence of execution.
helpful to set R greater than r because there is no restriction on the amount of data a
reduce process can handle. As a rule of thumb, when the number of map output keys
is much large than r , R is often set close to the number of all available reduce slots for
an in-house cluster, for example, 95% of all reduce slots [22]. When it comes to public
clouds, we will set R = r and choose an appropriate number of reduce slots, r , to find
the best tradeoff between the time and the financial cost.
Figure 17.2 illustrates the scheduling of map and reduce processes to the map
and reduce slots in the ideal situation. In practice, map processes in the same round
may not finish exactly at the same time—some may finish earlier or later than oth-
ers due to the system configuration, the disk I/O, the network traffic, and the data
distribution. However, we can use the total number of rounds to roughly estimate
the total time spent in the map phase. The variance caused by all these factors will
be considered in modeling. Intuitively, the more available slots, the faster the whole
MapReduce job can be finished. However, in the pay-as-you-go setting, there is
a tradeoff between the amount of resources and the amount of time to finish the
MapReduce job. Thus, we cannot simply increase the amount of resources.
In addition to the cost of map and reduce processes, the system has some addi-
tional cost for managing and scheduling the M map processes and the R reduce pro-
cesses, which will also be considered in modeling. Based on this understanding, we
will first analyze the cost of each map process and reduce process, respectively, and
then derive the overall cost model.
[ M / m ] rounds of map process
Map
process
Map
process
Map
process
Reduce
process
Intermediate
results
Map
process
Map
process
Map
process
Reduce
process
Map
process
Map
process
Map
process
Time
FIGURE 17.2
Illustration of parallel and sequential execution in the ideal situation.
Search WWH ::




Custom Search