Database Reference
In-Depth Information
Additional Considerations in Structuring a MapReduce Job
The preceding discussion presented the basics of structuring and running a
MapReduce job on a Hadoop cluster. Several Hadoop features provide additional
functionality to a MapReduce job.
First, a combiner is a useful option to apply, when possible, between the map task
and the shuffle and sort. Typically, the combiner applies the same logic used in the
reducer, but it also applies this logic on the output of each map task. In the word
count example, a combiner sums up the number of occurrences of each word from
a mapper's output. Figure 10.4 illustrates how a combiner processes a single string
in the simple word count example.
Figure 10.4 Using a combiner
Thus, in a production setting, instead of ten thousand possible <the, 1> key/
value pairs being emitted from the map task to the Shuffle and Sort, the combiner
emits one <the, 10000> key/value pair. The reduce step still obtains a list
of values for each word, but instead of receiving a list of up to a million ones
list(1,1,. . .,1) for a key, the reduce step obtains a list, such as
list(10000,964,. . .,8345) , which might be as long as the number of map
tasks that were run. The use of a combiner minimizes the amount of intermediate
map output that the reducer must store, transfer over the network, and process.
Another useful option is the partitioner . It determines the reducers that receive
keys and the corresponding list of values. Using the simple word count example,
Figure 10.5 shows that a partitioner can send every word that begins with a vowel
to one reducer and the other words that begin with a consonant to another reducer.
Search WWH ::




Custom Search