Developing a MapReduce Application - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

cers. There are a couple of caveats, however. The local job runner is a very different en-

vironment from a cluster, and the data flow patterns are very different. Optimizing the

CPU performance of your code may be pointless if your MapReduce job is I/O-bound (as

many jobs are). To be sure that any tuning is effective, you should compare the new exe-

cution time with the old one running on a real cluster. Even this is easier said than done,

since job execution times can vary due to resource contention with other jobs and the de-

cisions the scheduler makes regarding task placement. To get a good idea of job execution

time under these circumstances, perform a series of runs (with and without the change)

and check whether any improvement is statistically significant.

It's unfortunately true that some problems (such as excessive memory use) can be repro-

duced only on the cluster, and in these cases the ability to profile in situ is indispensable.

The HPROF profiler

There are a number of configuration properties to control profiling, which are also ex-

posed via convenience methods on JobConf . Enabling profiling is as simple as setting

the property mapreduce.task.profile to true :

% hadoop jar hadoop-examples.jar v4.MaxTemperatureDriver \

-conf conf/hadoop-cluster.xml \

-D mapreduce.task.profile=true \

input/ncdc/all max-temp

This runs the job as normal, but adds an -agentlib parameter to the Java command

used to launch the task containers on the node managers. You can control the precise para-

meter that is added by setting the mapreduce.task.profile.params property.

The default uses HPROF, a profiling tool that comes with the JDK that, although basic,

can give valuable information about a program's CPU and heap usage.

It doesn't usually make sense to profile all tasks in the job, so by default only those with

IDs 0, 1, and 2 are profiled (for both maps and reduces). You can change this by setting

mapreduce.task.profile.maps and mapreduce.task.profile.reduces

to specify the range of task IDs to profile.

The profile output for each task is saved with the task logs in the userlogs subdirectory of

the node manager's local log directory (alongside the syslog , stdout , and stderr files), and

can be retrieved in the way described in Hadoop Logs , according to whether log aggrega-

tion is enabled or not.

Search WWH ::

Custom Search

Home