Databases Reference
In-Depth Information
We've seen how using the Aggregate package under Streaming is a simple way to
get some popular metrics. It's a great demonstration of Hadoop's power in simplifying
the analysis of large data sets.
4.6
Improving performance with combiners
We saw in AverageByAttributeMapper.py and AverageByAttributeReducer.py
(listings 4.7 and 4.8) how to compute the average for each attribute. The mapper
reads each record and outputs a key/value pair for the record's attribute and count.
It shuffles the key/value pairs across the network, and the reducer computes the aver-
age for each key. In our example of computing the average number of claims for each
country's patents, we see at least two efficiency bottlenecks:
If we have 1 billion input records, the mappers will generate 1 billion key/
value pairs that will be shuffled across the network. If we were computing a
function such as maximum, it's obvious that the mapper only has to output
the maximum
1
for each key it has seen. Doing so would reduce network traffic
and increase performance. For a function such as average, it's a bit more
complicated, but we can still redefine the algorithm such that for each mapper
only one record is shuffled for each key.
Using country from the patent data set as key illustrates data skew . The data is
far from uniformly distributed, as a significant majority of the records would
have U.S. as the key. Not only does every key/value pair in the input map to a
key/value pair in the intermediate data, most of the intermediate key/value
pairs will end up at a single reducer, overwhelming it.
2
Hadoop solves these bottlenecks by extending the MapReduce framework with a com-
biner step in between the mapper and reducer. You can think of the combiner as a helper
for the reducer. It's supposed to whittle down the output of the mapper to lessen the
load on the network and on the reducer. If we specify a combiner, the MapReduce
framework may apply it zero, one, or more times to the intermediate data. In order for
a combiner to work, it must be an equivalent transformation of the data with respect
to the reducer. If we take out the combiner, the reducer's output will remain the same.
Furthermore, the equivalent transformation property must hold when the combiner is
applied to arbitrary subsets of the intermediate data.
If the reducer only performs a distributive function,
such as maximum, minimum,
and summation (counting), then we can use the reducer itself as the combiner.
But many useful functions aren't distributive. We can rewrite some of them, such as
averaging to take advantage of a combiner.
The averaging approach taken by AverageByAttributeMapper.py is to output
only each key/value pair. AverageByAttributeReducer.py will count the number of
key/value pairs it receives and sum up their values, in order for a single final division
to compute the average. The main obstacle to using a combiner is the counting
operation, as the reducer assumes the number of key/value pairs it receives is the
 
Search WWH ::




Custom Search