MapReduce Features - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

% hadoop jar hadoop-examples.jar

SortByTemperatureUsingTotalOrderPartitioner \

-D mapreduce.job.reduces=30 input/ncdc/all-seq output-totalsort

The program produces 30 output partitions, each of which is internally sorted; in addition,

for these partitions, all the keys in partition i are less than the keys in partition i + 1.

Secondary Sort

The MapReduce framework sorts the records by key before they reach the reducers. For

any particular key, however, the values are not sorted. The order in which the values ap-

pear is not even stable from one run to the next, because they come from different map

tasks, which may finish at different times from run to run. Generally speaking, most

MapReduce programs are written so as not to depend on the order in which the values ap-

pear to the reduce function. However, it is possible to impose an order on the values by

sorting and grouping the keys in a particular way.

To illustrate the idea, consider the MapReduce program for calculating the maximum tem-

perature for each year. If we arranged for the values (temperatures) to be sorted in des-

cending order, we wouldn't have to iterate through them to find the maximum; instead, we

could take the first for each year and ignore the rest. (This approach isn't the most effi-

cient way to solve this particular problem, but it illustrates how secondary sort works in

general.)

To achieve this, we change our keys to be composite: a combination of year and temperat-

ure. We want the sort order for keys to be by year (ascending) and then by temperature

(descending):

1900 35°C

1900 34°C

...

1901 36°C

1901 35°C

If all we did was change the key, this wouldn't help, because then records for the same

year would have different keys and therefore would not (in general) go to the same redu-

cer. For example, (1900, 35°C) and (1900, 34°C) could go to different reducers. By setting

a partitioner to partition by the year part of the key, we can guarantee that records for the

same year go to the same reducer. This still isn't enough to achieve our goal, however. A

partitioner ensures only that one reducer receives all the records for a year; it doesn't

change the fact that the reducer groups by key within the partition:

Search WWH ::

Custom Search

Home