Database Reference
In-Depth Information
Tuning a Job
After a job is working, the question many developers ask is, “Can I make it run faster?”
There are a few Hadoop-specific “usual suspects” that are worth checking to see whether
they are responsible for a performance problem. You should run through the checklist in
Table 6-3 before you start trying to profile or optimize at the task level.
Table 6-3. Tuning checklist
Area
Best practice
Further information
Number of map-
pers
How long are your mappers running for? If they are only
running for a few seconds on average, you should see
whether there's a way to have fewer mappers and make
them all run longer — a minute or so, as a rule of thumb.
The extent to which this is possible depends on the input
format you are using.
Small files and Com-
bineFileInputFormat
Number of re-
ducers
Check that you are using more than a single reducer. Re-
duce tasks should run for five minutes or so and produce at
least a block's worth of data, as a rule of thumb.
Choosing the Number of
Reducers
Combiners
Check whether your job can take advantage of a combiner
to reduce the amount of data passing through the shuffle.
Combiner Functions
Intermediate
compression
Job execution time can almost always benefit from en-
abling map output compression.
Compressing map output
Custom serializ-
ation
If you are using your own custom Writable objects or
custom comparators, make sure you have implemented
RawComparator .
Implementing a RawCom-
parator for speed
Shuffle tweaks The MapReduce shuffle exposes around a dozen tuning
parameters for memory management, which may help you
wring out the last bit of performance.
Configuration Tuning
Profiling Tasks
Like debugging, profiling a job running on a distributed system such as MapReduce
presents some challenges. Hadoop allows you to profile a fraction of the tasks in a job and,
as each task completes, pulls down the profile information to your machine for later analys-
is with standard profiling tools.
Of course, it's possible, and somewhat easier, to profile a job running in the local job run-
ner. And provided you can run with enough input data to exercise the map and reduce
tasks, this can be a valuable way of improving the performance of your mappers and redu-
Search WWH ::




Custom Search