MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

It should now be clear why the optimal split size is the same as the block size: it is the

largest size of input that can be guaranteed to be stored on a single node. If the split

spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some

of the split would have to be transferred across the network to the node running the map

task, which is clearly less efficient than running the whole map task using local data.

Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is in-

termediate output: it's processed by reduce tasks to produce the final output, and once the

job is complete, the map output can be thrown away. So, storing it in HDFS with replica-

tion would be overkill. If the node running the map task fails before the map output has

been consumed by the reduce task, then Hadoop will automatically rerun the map task on

another node to re-create the map output.

Figure 2-2. Data-local (a), rack-local (b), and off-rack (c) map tasks

Reduce tasks don't have the advantage of data locality; the input to a single reduce task is

normally the output from all mappers. In the present example, we have a single reduce

task that is fed by all of the map tasks. Therefore, the sorted map outputs have to be trans-

ferred across the network to the node where the reduce task is running, where they are

merged and then passed to the user-defined reduce function. The output of the reduce is

normally stored in HDFS for reliability. As explained in Chapter 3 , for each HDFS block

of the reduce output, the first replica is stored on the local node, with other replicas being

stored on off-rack nodes for reliability. Thus, writing the reduce output does consume net-

work bandwidth, but only as much as a normal HDFS write pipeline consumes.

The whole data flow with a single reduce task is illustrated in Figure 2-3 . The dotted

boxes indicate nodes, the dotted arrows show data transfers on a node, and the solid ar-

rows show data transfers between nodes.

Search WWH ::

Custom Search

Home