Database Reference
In-Depth Information
Figure 2-3. MapReduce data flow with a single reduce task
The number of reduce tasks is not governed by the size of the input, but instead is speci-
fied independently. In The Default MapReduce Job , you will see how to choose the num-
ber of reduce tasks for a given job.
When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys (and their associated values) in
each partition, but the records for any given key are all in a single partition. The partition-
ing can be controlled by a user-defined partitioning function, but normally the default par-
titioner β€” which buckets keys using a hash function β€” works very well.
The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4 .
This diagram makes it clear why the data flow between map and reduce tasks is colloqui-
ally known as β€œthe shuffle,” as each reduce task is fed by many map tasks. The shuffle is
more complicated than this diagram suggests, and tuning it can have a big impact on job
execution time, as you will see in Shuffle and Sort .
Search WWH ::




Custom Search