MapReduce Features - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

public static void main ( String [] args ) throws Exception {

int exitCode = ToolRunner . run (

new SortByTemperatureUsingTotalOrderPartitioner (), args );

System . exit ( exitCode );

}

We use a RandomSampler , which chooses keys with a uniform probability — here, 0.1.

There are also parameters for the maximum number of samples to take and the maximum

number of splits to sample (here, 10,000 and 10, respectively; these settings are the de-

faults when InputSampler is run as an application), and the sampler stops when the

first of these limits is met. Samplers run on the client, making it important to limit the

number of splits that are downloaded so the sampler runs quickly. In practice, the time

taken to run the sampler is a small fraction of the overall job time.

The InputSampler writes a partition file that we need to share with the tasks running

on the cluster by adding it to the distributed cache (see Distributed Cache ) .

On one run, the sampler chose -5.6°C, 13.9°C, and 22.0°C as partition boundaries (for

four partitions), which translates into more even partition sizes than the earlier choice:

Temperature range < -5.6°C [-5.6°C, 13.9°C) [13.9°C, 22.0°C) >= 22.0°C

Proportion of records 29%

24%

23%

24%

Your input data determines the best sampler to use. For example, SplitSampler ,

which samples only the first n records in a split, is not so good for sorted data, [ 64 ] because

it doesn't select keys from throughout the split.

On the other hand, IntervalSampler chooses keys at regular intervals through the

split and makes a better choice for sorted data. RandomSampler is a good general-pur-

pose sampler. If none of these suits your application (and remember that the point of

sampling is to produce partitions that are approximately equal in size), you can write your

own implementation of the Sampler interface.

One of the nice properties of InputSampler and TotalOrderPartitioner is that

you are free to choose the number of partitions — that is, the number of reducers.

However, TotalOrderPartitioner will work only if the partition boundaries are

distinct. One problem with choosing a high number is that you may get collisions if you

have a small key space.

Here's how we run it:

Search WWH ::

Custom Search

Home