Programming Practices - Hadoop in Action

Databases Reference

In-Depth Information

6.2.3

Rerunning failed tasks

with IsolationRunner

Debugging through log files is about reconstructing events using generic historical

records. Sometimes there's not enough information in the logs to trace back the cause

of failure. Hadoop has an IsolationRunner utility that functions like a time machine

for debugging. This utility can isolate and rerun the failed task with the exact same

input on the same node. You can attach a debugger to monitor the task as it runs and

focus on gathering evidence specific to the failure.

To use the IsolationRunner feature, you must run your job with the configuration

property keep.failed.tasks.files set to true. This tells every TaskTracker to keep

all the data necessary to rerun the failed tasks.

When a job fails, you use the JobTracker Web UI to locate the node, the job ID, and

the task attempt ID of the failed task. You log into the node where the task failed and

go to the work directory under the directory for the task attempt. Go to

local_dir /taskTracker/jobcache/ job_id / attempt_id /work

where job_id and attempt_id are the job ID and task attempt ID of the failed task. (The

job ID

should start with “attempt_”.)

The root directory local_dir is what is set in the configuration property mapred.local.

dir . Note that Hadoop allows a node to use multiple local directories (by setting

mapred.local.dir to a comma-separated list of directories) to spread out disk I/O

among multiple drives. If the node is configured that way, you'll have to look in all the

local directories to find the one with the right attempt_id subdirectory.

Within the work directory you can execute IsolationRunner to rerun the failed task

with the same input that it had before. In the rerun, we want the JVM to be enabled

for remote debugging. As we're not running the JVM directly but through the bin/

hadoop script, we specify the JVM debugging options through HADOOP_OPTS:

export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,

should start with “job_” and the task attempt ID

➥

server=y,address=8000"

It tells the JVM to listen for the debugger at port 8000 and to wait for the debugger

getting attached before running any code. 6 We now use IsolationRunner to rerun

the task:

bin/hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml

The job.xml file contains all the configuration information IsolationRunner needs.

Given our specification, the JVM will wait for a debugger's attachment before execut-

ing the task. You can attach to the JVM any Java debugger

that supports the Java Debug

Wire Protocol (JDWP). All the major Java IDEs do so. For example, if you're using jdb,

you can attach it to the JVM via

jdb -attach 8000

6 Options to configure the Sun JVM for debugging are further explained in Sun's documentation: http://

java.sun.com/javase/6/docs/technotes/guides/jpda/conninv.html#Invocation.

Hadoop in Action

Search WWH ::

Custom Search

Home