Databases Reference
In-Depth Information
6.2.3
Rerunning failed tasks
with IsolationRunner
Debugging through log files is about reconstructing events using generic historical
records. Sometimes there's not enough information in the logs to trace back the cause
of failure. Hadoop has an IsolationRunner utility that functions like a time machine
for debugging. This utility can isolate and rerun the failed task with the exact same
input on the same node. You can attach a debugger to monitor the task as it runs and
focus on gathering evidence specific to the failure.
To use the IsolationRunner feature, you must run your job with the configuration
property keep.failed.tasks.files set to true. This tells every TaskTracker to keep
all the data necessary to rerun the failed tasks.
When a job fails, you use the JobTracker Web UI to locate the node, the job ID, and
the task attempt ID of the failed task. You log into the node where the task failed and
go to the work directory under the directory for the task attempt. Go to
local_dir /taskTracker/jobcache/ job_id / attempt_id /work
where job_id and attempt_id are the job ID and task attempt ID of the failed task. (The
job ID
should start with “attempt_”.)
The root directory local_dir is what is set in the configuration property mapred.local.
dir . Note that Hadoop allows a node to use multiple local directories (by setting
mapred.local.dir to a comma-separated list of directories) to spread out disk I/O
among multiple drives. If the node is configured that way, you'll have to look in all the
local directories to find the one with the right attempt_id subdirectory.
Within the work directory you can execute IsolationRunner to rerun the failed task
with the same input that it had before. In the rerun, we want the JVM to be enabled
for remote debugging. As we're not running the JVM directly but through the bin/
hadoop script, we specify the JVM debugging options through HADOOP_OPTS:
export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,
should start with “job_” and the task attempt ID
server=y,address=8000"
It tells the JVM to listen for the debugger at port 8000 and to wait for the debugger
getting attached before running any code. 6 We now use IsolationRunner to rerun
the task:
bin/hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
The job.xml file contains all the configuration information IsolationRunner needs.
Given our specification, the JVM will wait for a debugger's attachment before execut-
ing the task. You can attach to the JVM any Java debugger
that supports the Java Debug
Wire Protocol (JDWP). All the major Java IDEs do so. For example, if you're using jdb,
you can attach it to the JVM via
jdb -attach 8000
6 Options to configure the Sun JVM for debugging are further explained in Sun's documentation: http://
java.sun.com/javase/6/docs/technotes/guides/jpda/conninv.html#Invocation.
 
Search WWH ::




Custom Search