Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Additional Considerations in Structuring a MapReduce Job

The preceding discussion presented the basics of structuring and running a

MapReduce job on a Hadoop cluster. Several Hadoop features provide additional

functionality to a MapReduce job.

First, a combiner is a useful option to apply, when possible, between the map task

and the shuffle and sort. Typically, the combiner applies the same logic used in the

reducer, but it also applies this logic on the output of each map task. In the word

count example, a combiner sums up the number of occurrences of each word from

a mapper's output. Figure 10.4 illustrates how a combiner processes a single string

in the simple word count example.

Figure 10.4 Using a combiner

Thus, in a production setting, instead of ten thousand possible <the, 1> key/

value pairs being emitted from the map task to the Shuffle and Sort, the combiner

emits one <the, 10000> key/value pair. The reduce step still obtains a list

of values for each word, but instead of receiving a list of up to a million ones

list(1,1,. . .,1) for a key, the reduce step obtains a list, such as

list(10000,964,. . .,8345) , which might be as long as the number of map

tasks that were run. The use of a combiner minimizes the amount of intermediate

map output that the reducer must store, transfer over the network, and process.

Another useful option is the partitioner . It determines the reducers that receive

keys and the corresponding list of values. Using the simple word count example,

Figure 10.5 shows that a partitioner can send every word that begins with a vowel

to one reducer and the other words that begin with a consonant to another reducer.

Search WWH ::

Custom Search

Home