Database Reference
In-Depth Information
An example: Partitioning data
Consider the problem of partitioning the weather dataset by weather station. We would
like to run a job whose output is one file per station, with each file containing all the re-
cords for that station.
One way of doing this is to have a reducer for each weather station. To arrange this, we
need to do two things. First, write a partitioner that puts records from the same weather
station into the same partition. Second, set the number of reducers on the job to be the
number of weather stations. The partitioner would look like this:
public class StationPartitioner extends Partitioner < LongWritable ,
Text > {
private NcdcRecordParser parser = new NcdcRecordParser ();
@Override
public int getPartition ( LongWritable key , Text value , int
numPartitions ) {
parser . parse ( value );
return getPartition ( parser . getStationId ());
}
private int getPartition ( String stationId ) {
...
}
}
The getPartition(String) method, whose implementation is not shown, turns the
station ID into a partition index. To do this, it needs a list of all the station IDs; it then just
returns the index of the station ID in the list.
There are two drawbacks to this approach. The first is that since the number of partitions
needs to be known before the job is run, so does the number of weather stations. Although
the NCDC provides metadata about its stations, there is no guarantee that the IDs en-
countered in the data will match those in the metadata. A station that appears in the
metadata but not in the data wastes a reduce task. Worse, a station that appears in the data
but not in the metadata doesn't get a reduce task; it has to be thrown away. One way of
mitigating this problem would be to write a job to extract the unique station IDs, but it's a
shame that we need an extra job to do this.
The second drawback is more subtle. It is generally a bad idea to allow the number of par-
titions to be rigidly fixed by the application, since this can lead to small or uneven-sized
partitions. Having many reducers doing a small amount of work isn't an efficient way of
Search WWH ::




Custom Search