Database Reference
In-Depth Information
Sorting
The ability to sort data is at the heart of MapReduce. Even if your application isn't con-
cerned with sorting per se, it may be able to use the sorting stage that MapReduce provides
to organize its data. In this section, we examine different ways of sorting datasets and how
you can control the sort order in MapReduce. Sorting Avro data is covered separately, in
Preparation
We are going to sort the weather dataset by temperature. Storing temperatures as
Text
ob-
jects doesn't work for sorting purposes, because signed integers don't sort lexicographic-
keys represent the temperatures (and sort correctly) and whose
Text
values are the lines of
data.
The MapReduce job in
Example 9-3
is a map-only job that also filters the input to remove
records that don't have a valid temperature reading. Each map creates a single block-com-
pressed sequence file as output. It is invoked with the following command:
%
hadoop jar hadoop-examples.jar SortDataPreprocessor input/ncdc/all \
input/ncdc/all-seq
Example 9-3. A MapReduce program for transforming the weather data into SequenceFile
format
public class
SortDataPreprocessor
extends
Configured
implements
Tool
{
static class
CleanerMapper
extends
Mapper
<
LongWritable
,
Text
,
IntWritable
,
Text
> {
private
NcdcRecordParser parser
=
new
NcdcRecordParser
();
@Override
protected
void
map
(
LongWritable key
,
Text value
,
Context context
)
throws
IOException
,
InterruptedException
{
parser
.
parse
(
value
);
if
(
parser
.
isValidTemperature
()) {
context
.
write
(
new
IntWritable
(
parser
.
getAirTemperature
()),
value
);
}
}
}