Database Reference
In-Depth Information
cers. There are a couple of caveats, however. The local job runner is a very different en-
vironment from a cluster, and the data flow patterns are very different. Optimizing the
CPU performance of your code may be pointless if your MapReduce job is I/O-bound (as
many jobs are). To be sure that any tuning is effective, you should compare the new exe-
cution time with the old one running on a real cluster. Even this is easier said than done,
since job execution times can vary due to resource contention with other jobs and the de-
cisions the scheduler makes regarding task placement. To get a good idea of job execution
time under these circumstances, perform a series of runs (with and without the change)
and check whether any improvement is statistically significant.
It's unfortunately true that some problems (such as excessive memory use) can be repro-
duced only on the cluster, and in these cases the ability to profile in situ is indispensable.
The HPROF profiler
There are a number of configuration properties to control profiling, which are also ex-
posed via convenience methods on JobConf . Enabling profiling is as simple as setting
the property mapreduce.task.profile to true :
% hadoop jar hadoop-examples.jar v4.MaxTemperatureDriver \
-conf conf/hadoop-cluster.xml \
-D mapreduce.task.profile=true \
input/ncdc/all max-temp
This runs the job as normal, but adds an -agentlib parameter to the Java command
used to launch the task containers on the node managers. You can control the precise para-
meter that is added by setting the mapreduce.task.profile.params property.
The default uses HPROF, a profiling tool that comes with the JDK that, although basic,
can give valuable information about a program's CPU and heap usage.
It doesn't usually make sense to profile all tasks in the job, so by default only those with
IDs 0, 1, and 2 are profiled (for both maps and reduces). You can change this by setting
mapreduce.task.profile.maps and mapreduce.task.profile.reduces
to specify the range of task IDs to profile.
The profile output for each task is saved with the task logs in the userlogs subdirectory of
the node manager's local log directory (alongside the syslog , stdout , and stderr files), and
can be retrieved in the way described in Hadoop Logs , according to whether log aggrega-
tion is enabled or not.
Search WWH ::




Custom Search