Database Reference
In-Depth Information
public static
void
main
(
String
[]
args
)
throws
Exception
{
int
exitCode
=
ToolRunner
.
run
(
new
SortByTemperatureUsingTotalOrderPartitioner
(),
args
);
System
.
exit
(
exitCode
);
}
}
We use a
RandomSampler
, which chooses keys with a uniform probability — here, 0.1.
There are also parameters for the maximum number of samples to take and the maximum
number of splits to sample (here, 10,000 and 10, respectively; these settings are the de-
faults when
InputSampler
is run as an application), and the sampler stops when the
first of these limits is met. Samplers run on the client, making it important to limit the
number of splits that are downloaded so the sampler runs quickly. In practice, the time
taken to run the sampler is a small fraction of the overall job time.
The
InputSampler
writes a partition file that we need to share with the tasks running
on the cluster by adding it to the distributed cache (see
Distributed Cache
)
.
On one run, the sampler chose -5.6°C, 13.9°C, and 22.0°C as partition boundaries (for
four partitions), which translates into more even partition sizes than the earlier choice:
Temperature range
< -5.6°C [-5.6°C, 13.9°C) [13.9°C, 22.0°C) >= 22.0°C
Proportion of records
29%
24%
23%
24%
Your input data determines the best sampler to use. For example,
SplitSampler
,
it doesn't select keys from throughout the split.
On the other hand,
IntervalSampler
chooses keys at regular intervals through the
split and makes a better choice for sorted data.
RandomSampler
is a good general-pur-
pose sampler. If none of these suits your application (and remember that the point of
sampling is to produce partitions that are approximately equal in size), you can write your
own implementation of the
Sampler
interface.
One of the nice properties of
InputSampler
and
TotalOrderPartitioner
is that
you are free to choose the number of partitions — that is, the number of reducers.
However,
TotalOrderPartitioner
will work only if the partition boundaries are
distinct. One problem with choosing a high number is that you may get collisions if you
have a small key space.
Here's how we run it: