How MapReduce Works - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Shuffle and Sort

MapReduce makes the guarantee that the input to every reducer is sorted by key. The pro-

cess by which the system performs the sort — and transfers the map outputs to the reducers

as inputs — is known as the shuffle. [ 54 ] In this section, we look at how the shuffle works, as

a basic understanding will be helpful should you need to optimize a MapReduce program.

The shuffle is an area of the codebase where refinements and improvements are continually

being made, so the following description necessarily conceals many details. In many ways,

the shuffle is the heart of MapReduce and is where the “magic” happens.

The Map Side

When the map function starts producing output, it is not simply written to disk. The process

is more involved, and takes advantage of buffering writes in memory and doing some pre-

sorting for efficiency reasons. Figure 7-4 shows what happens.

Figure 7-4. Shuffle and sort in MapReduce

Each map task has a circular memory buffer that it writes the output to. The buffer is 100

MB by default (the size can be tuned by changing the mapreduce.task.io.sort.mb

property). When the contents of the buffer reach a certain threshold size ( mapre-

duce.map.sort.spill.percent , which has the default value 0.80, or 80%), a

background thread will start to spill the contents to disk. Map outputs will continue to be

written to the buffer while the spill takes place, but if the buffer fills up during this time,

the map will block until the spill is complete. Spills are written in round-robin fashion to

Search WWH ::

Custom Search

Home