Database Reference
In-Depth Information
% hadoop jar hadoop-examples.jar
SortByTemperatureUsingTotalOrderPartitioner \
-D mapreduce.job.reduces=30 input/ncdc/all-seq output-totalsort
The program produces 30 output partitions, each of which is internally sorted; in addition,
for these partitions, all the keys in partition i are less than the keys in partition i + 1.
Secondary Sort
The MapReduce framework sorts the records by key before they reach the reducers. For
any particular key, however, the values are not sorted. The order in which the values ap-
pear is not even stable from one run to the next, because they come from different map
tasks, which may finish at different times from run to run. Generally speaking, most
MapReduce programs are written so as not to depend on the order in which the values ap-
pear to the reduce function. However, it is possible to impose an order on the values by
sorting and grouping the keys in a particular way.
To illustrate the idea, consider the MapReduce program for calculating the maximum tem-
perature for each year. If we arranged for the values (temperatures) to be sorted in des-
cending order, we wouldn't have to iterate through them to find the maximum; instead, we
could take the first for each year and ignore the rest. (This approach isn't the most effi-
cient way to solve this particular problem, but it illustrates how secondary sort works in
general.)
To achieve this, we change our keys to be composite: a combination of year and temperat-
ure. We want the sort order for keys to be by year (ascending) and then by temperature
(descending):
1900 35°C
1900 34°C
1900 34°C
...
1901 36°C
1901 35°C
If all we did was change the key, this wouldn't help, because then records for the same
year would have different keys and therefore would not (in general) go to the same redu-
cer. For example, (1900, 35°C) and (1900, 34°C) could go to different reducers. By setting
a partitioner to partition by the year part of the key, we can guarantee that records for the
same year go to the same reducer. This still isn't enough to achieve our goal, however. A
partitioner ensures only that one reducer receives all the records for a year; it doesn't
change the fact that the reducer groups by key within the partition:
Search WWH ::




Custom Search