Database Reference
In-Depth Information
public static void main ( String [] args ) throws Exception {
int exitCode = ToolRunner . run (
new SortByTemperatureUsingTotalOrderPartitioner (), args );
System . exit ( exitCode );
}
}
We use a RandomSampler , which chooses keys with a uniform probability — here, 0.1.
There are also parameters for the maximum number of samples to take and the maximum
number of splits to sample (here, 10,000 and 10, respectively; these settings are the de-
faults when InputSampler is run as an application), and the sampler stops when the
first of these limits is met. Samplers run on the client, making it important to limit the
number of splits that are downloaded so the sampler runs quickly. In practice, the time
taken to run the sampler is a small fraction of the overall job time.
The InputSampler writes a partition file that we need to share with the tasks running
on the cluster by adding it to the distributed cache (see Distributed Cache ) .
On one run, the sampler chose -5.6°C, 13.9°C, and 22.0°C as partition boundaries (for
four partitions), which translates into more even partition sizes than the earlier choice:
Temperature range < -5.6°C [-5.6°C, 13.9°C) [13.9°C, 22.0°C) >= 22.0°C
Proportion of records 29%
24%
23%
24%
Your input data determines the best sampler to use. For example, SplitSampler ,
which samples only the first n records in a split, is not so good for sorted data, [ 64 ] because
it doesn't select keys from throughout the split.
On the other hand, IntervalSampler chooses keys at regular intervals through the
split and makes a better choice for sorted data. RandomSampler is a good general-pur-
pose sampler. If none of these suits your application (and remember that the point of
sampling is to produce partitions that are approximately equal in size), you can write your
own implementation of the Sampler interface.
One of the nice properties of InputSampler and TotalOrderPartitioner is that
you are free to choose the number of partitions — that is, the number of reducers.
However, TotalOrderPartitioner will work only if the partition boundaries are
distinct. One problem with choosing a high number is that you may get collisions if you
have a small key space.
Here's how we run it:
Search WWH ::




Custom Search