MapReduce Features - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Sorting

The ability to sort data is at the heart of MapReduce. Even if your application isn't con-

cerned with sorting per se, it may be able to use the sorting stage that MapReduce provides

to organize its data. In this section, we examine different ways of sorting datasets and how

you can control the sort order in MapReduce. Sorting Avro data is covered separately, in

Sorting Using Avro MapReduce .

Preparation

We are going to sort the weather dataset by temperature. Storing temperatures as Text ob-

jects doesn't work for sorting purposes, because signed integers don't sort lexicographic-

ally. [ 61 ] Instead, we are going to store the data using sequence files whose IntWritable

keys represent the temperatures (and sort correctly) and whose Text values are the lines of

data.

The MapReduce job in Example 9-3 is a map-only job that also filters the input to remove

records that don't have a valid temperature reading. Each map creates a single block-com-

pressed sequence file as output. It is invoked with the following command:

% hadoop jar hadoop-examples.jar SortDataPreprocessor input/ncdc/all \

input/ncdc/all-seq

Example 9-3. A MapReduce program for transforming the weather data into SequenceFile

format

public class SortDataPreprocessor extends Configured implements Tool {

static class CleanerMapper

extends Mapper < LongWritable , Text , IntWritable , Text > {

private NcdcRecordParser parser = new NcdcRecordParser ();

@Override

protected void map ( LongWritable key , Text value , Context context )

throws IOException , InterruptedException {

parser . parse ( value );

if ( parser . isValidTemperature ()) {

context . write ( new IntWritable ( parser . getAirTemperature ()),

value );

}

Search WWH ::

Custom Search

Home