Database Reference
In-Depth Information
organizing a job; it's much better to get reducers to do more work and have fewer of them,
as the overhead in running a task is then reduced. Uneven-sized partitions can be difficult
to avoid, too. Different weather stations will have gathered a widely varying amount of
data; for example, compare a station that opened one year ago to one that has been gather-
ing data for a century. If a few reduce tasks take significantly longer than the others, they
will dominate the job execution time and cause it to be longer than it needs to be.
NOTE
There are two special cases when it does make sense to allow the application to set the number of parti-
tions (or equivalently, the number of reducers):
Zero reducers
This is a vacuous case: there are no partitions, as the application needs to run only map tasks.
One reducer
It can be convenient to run small jobs to combine the output of previous jobs into a single file. This
should be attempted only when the amount of data is small enough to be processed comfortably by
one reducer.
It is much better to let the cluster drive the number of partitions for a job, the idea being
that the more cluster resources there are available, the faster the job can complete. This is
why the default HashPartitioner works so well: it works with any number of parti-
tions and ensures each partition has a good mix of keys, leading to more evenly sized par-
titions.
If we go back to using HashPartitioner , each partition will contain multiple sta-
tions, so to create a file per station, we need to arrange for each reducer to write multiple
files. This is where MultipleOutputs comes in.
MultipleOutputs
MultipleOutputs allows you to write data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string. This allows each reducer (or
mapper in a map-only job) to create more than a single file. Filenames are of the form
name -m- nnnnn for map outputs and name -r- nnnnn for reduce outputs, where name is
an arbitrary name that is set by the program and nnnnn is an integer designating the part
number, starting from 00000 . The part number ensures that outputs written from different
partitions (mappers or reducers) do not collide in the case of the same name.
Search WWH ::




Custom Search