Database Reference
In-Depth Information
It should now be clear why the optimal split size is the same as the block size: it is the
largest size of input that can be guaranteed to be stored on a single node. If the split
spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some
of the split would have to be transferred across the network to the node running the map
task, which is clearly less efficient than running the whole map task using local data.
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is in-
termediate output: it's processed by reduce tasks to produce the final output, and once the
job is complete, the map output can be thrown away. So, storing it in HDFS with replica-
tion would be overkill. If the node running the map task fails before the map output has
been consumed by the reduce task, then Hadoop will automatically rerun the map task on
another node to re-create the map output.
Figure 2-2. Data-local (a), rack-local (b), and off-rack (c) map tasks
Reduce tasks don't have the advantage of data locality; the input to a single reduce task is
normally the output from all mappers. In the present example, we have a single reduce
task that is fed by all of the map tasks. Therefore, the sorted map outputs have to be trans-
ferred across the network to the node where the reduce task is running, where they are
merged and then passed to the user-defined reduce function. The output of the reduce is
normally stored in HDFS for reliability. As explained in Chapter 3 , for each HDFS block
of the reduce output, the first replica is stored on the local node, with other replicas being
stored on off-rack nodes for reliability. Thus, writing the reduce output does consume net-
work bandwidth, but only as much as a normal HDFS write pipeline consumes.
The whole data flow with a single reduce task is illustrated in Figure 2-3 . The dotted
boxes indicate nodes, the dotted arrows show data transfers on a node, and the solid ar-
rows show data transfers between nodes.
Search WWH ::




Custom Search