Database Reference
In-Depth Information
Sorting
The ability to sort data is at the heart of MapReduce. Even if your application isn't con-
cerned with sorting per se, it may be able to use the sorting stage that MapReduce provides
to organize its data. In this section, we examine different ways of sorting datasets and how
you can control the sort order in MapReduce. Sorting Avro data is covered separately, in
Sorting Using Avro MapReduce .
Preparation
We are going to sort the weather dataset by temperature. Storing temperatures as Text ob-
jects doesn't work for sorting purposes, because signed integers don't sort lexicographic-
ally. [ 61 ] Instead, we are going to store the data using sequence files whose IntWritable
keys represent the temperatures (and sort correctly) and whose Text values are the lines of
data.
The MapReduce job in Example 9-3 is a map-only job that also filters the input to remove
records that don't have a valid temperature reading. Each map creates a single block-com-
pressed sequence file as output. It is invoked with the following command:
% hadoop jar hadoop-examples.jar SortDataPreprocessor input/ncdc/all \
input/ncdc/all-seq
Example 9-3. A MapReduce program for transforming the weather data into SequenceFile
format
public class SortDataPreprocessor extends Configured implements Tool {
static class CleanerMapper
extends Mapper < LongWritable , Text , IntWritable , Text > {
private NcdcRecordParser parser = new NcdcRecordParser ();
@Override
protected void map ( LongWritable key , Text value , Context context )
throws IOException , InterruptedException {
parser . parse ( value );
if ( parser . isValidTemperature ()) {
context . write ( new IntWritable ( parser . getAirTemperature ()),
value );
}
}
}
Search WWH ::




Custom Search