Writing basic MapReduce programs - Hadoop in Action

Databases Reference

In-Depth Information

We've seen how using the Aggregate package under Streaming is a simple way to

get some popular metrics. It's a great demonstration of Hadoop's power in simplifying

the analysis of large data sets.

4.6

Improving performance with combiners

We saw in AverageByAttributeMapper.py and AverageByAttributeReducer.py

(listings 4.7 and 4.8) how to compute the average for each attribute. The mapper

reads each record and outputs a key/value pair for the record's attribute and count.

It shuffles the key/value pairs across the network, and the reducer computes the aver-

age for each key. In our example of computing the average number of claims for each

country's patents, we see at least two efficiency bottlenecks:

If we have 1 billion input records, the mappers will generate 1 billion key/

value pairs that will be shuffled across the network. If we were computing a

function such as maximum, it's obvious that the mapper only has to output

the maximum

1

for each key it has seen. Doing so would reduce network traffic

and increase performance. For a function such as average, it's a bit more

complicated, but we can still redefine the algorithm such that for each mapper

only one record is shuffled for each key.

Using country from the patent data set as key illustrates data skew . The data is

far from uniformly distributed, as a significant majority of the records would

have U.S. as the key. Not only does every key/value pair in the input map to a

key/value pair in the intermediate data, most of the intermediate key/value

pairs will end up at a single reducer, overwhelming it.

2

Hadoop solves these bottlenecks by extending the MapReduce framework with a com-

biner step in between the mapper and reducer. You can think of the combiner as a helper

for the reducer. It's supposed to whittle down the output of the mapper to lessen the

load on the network and on the reducer. If we specify a combiner, the MapReduce

framework may apply it zero, one, or more times to the intermediate data. In order for

a combiner to work, it must be an equivalent transformation of the data with respect

to the reducer. If we take out the combiner, the reducer's output will remain the same.

Furthermore, the equivalent transformation property must hold when the combiner is

applied to arbitrary subsets of the intermediate data.

If the reducer only performs a distributive function,

such as maximum, minimum,

and summation (counting), then we can use the reducer itself as the combiner.

But many useful functions aren't distributive. We can rewrite some of them, such as

averaging to take advantage of a combiner.

The averaging approach taken by AverageByAttributeMapper.py is to output

only each key/value pair. AverageByAttributeReducer.py will count the number of

key/value pairs it receives and sum up their values, in order for a single final division

to compute the average. The main obstacle to using a combiner is the counting

operation, as the reducer assumes the number of key/value pairs it receives is the

Hadoop in Action

Search WWH ::

Custom Search

Home