MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

organizing a job; it's much better to get reducers to do more work and have fewer of them,

as the overhead in running a task is then reduced. Uneven-sized partitions can be difficult

to avoid, too. Different weather stations will have gathered a widely varying amount of

data; for example, compare a station that opened one year ago to one that has been gather-

ing data for a century. If a few reduce tasks take significantly longer than the others, they

will dominate the job execution time and cause it to be longer than it needs to be.

NOTE

There are two special cases when it does make sense to allow the application to set the number of parti-

tions (or equivalently, the number of reducers):

Zero reducers

This is a vacuous case: there are no partitions, as the application needs to run only map tasks.

One reducer

It can be convenient to run small jobs to combine the output of previous jobs into a single file. This

should be attempted only when the amount of data is small enough to be processed comfortably by

one reducer.

It is much better to let the cluster drive the number of partitions for a job, the idea being

that the more cluster resources there are available, the faster the job can complete. This is

why the default HashPartitioner works so well: it works with any number of parti-

tions and ensures each partition has a good mix of keys, leading to more evenly sized par-

titions.

If we go back to using HashPartitioner , each partition will contain multiple sta-

tions, so to create a file per station, we need to arrange for each reducer to write multiple

files. This is where MultipleOutputs comes in.

MultipleOutputs

MultipleOutputs allows you to write data to files whose names are derived from the

output keys and values, or in fact from an arbitrary string. This allows each reducer (or

mapper in a map-only job) to create more than a single file. Filenames are of the form

name -m- nnnnn for map outputs and name -r- nnnnn for reduce outputs, where name is

an arbitrary name that is set by the program and nnnnn is an integer designating the part

number, starting from 00000 . The part number ensures that outputs written from different

partitions (mappers or reducers) do not collide in the case of the same name.

Search WWH ::

Custom Search

Home