MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

An example: Partitioning data

Consider the problem of partitioning the weather dataset by weather station. We would

like to run a job whose output is one file per station, with each file containing all the re-

cords for that station.

One way of doing this is to have a reducer for each weather station. To arrange this, we

need to do two things. First, write a partitioner that puts records from the same weather

station into the same partition. Second, set the number of reducers on the job to be the

number of weather stations. The partitioner would look like this:

public class StationPartitioner extends Partitioner < LongWritable ,

Text > {

private NcdcRecordParser parser = new NcdcRecordParser ();

@Override

public int getPartition ( LongWritable key , Text value , int

numPartitions ) {

parser . parse ( value );

return getPartition ( parser . getStationId ());

}

private int getPartition ( String stationId ) {

...

}

The getPartition(String) method, whose implementation is not shown, turns the

station ID into a partition index. To do this, it needs a list of all the station IDs; it then just

returns the index of the station ID in the list.

There are two drawbacks to this approach. The first is that since the number of partitions

needs to be known before the job is run, so does the number of weather stations. Although

the NCDC provides metadata about its stations, there is no guarantee that the IDs en-

countered in the data will match those in the metadata. A station that appears in the

metadata but not in the data wastes a reduce task. Worse, a station that appears in the data

but not in the metadata doesn't get a reduce task; it has to be thrown away. One way of

mitigating this problem would be to write a job to extract the unique station IDs, but it's a

shame that we need an extra job to do this.

The second drawback is more subtle. It is generally a bad idea to allow the number of par-

titions to be rigidly fixed by the application, since this can lead to small or uneven-sized

partitions. Having many reducers doing a small amount of work isn't an efficient way of

Search WWH ::

Custom Search

Home