MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Figure 2-3. MapReduce data flow with a single reduce task

The number of reduce tasks is not governed by the size of the input, but instead is speci-

fied independently. In The Default MapReduce Job , you will see how to choose the num-

ber of reduce tasks for a given job.

When there are multiple reducers, the map tasks partition their output, each creating one

partition for each reduce task. There can be many keys (and their associated values) in

each partition, but the records for any given key are all in a single partition. The partition-

ing can be controlled by a user-defined partitioning function, but normally the default par-

titioner — which buckets keys using a hash function — works very well.

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4 .

This diagram makes it clear why the data flow between map and reduce tasks is colloqui-

ally known as “the shuffle,” as each reduce task is fed by many map tasks. The shuffle is

more complicated than this diagram suggests, and tuning it can have a big impact on job

execution time, as you will see in Shuffle and Sort .

Search WWH ::

Custom Search

Home