Database Reference
In-Depth Information
Failures
In the real world, user code is buggy, processes crash, and machines fail. One of the major
benefits of using Hadoop is its ability to handle such failures and allow your job to com-
plete successfully. We need to consider the failure of any of the following entities: the task,
the application master, the node manager, and the resource manager.
Task Failure
Consider first the case of the task failing. The most common occurrence of this failure is
when user code in the map or reduce task throws a runtime exception. If this happens, the
task JVM reports the error back to its parent application master before it exits. The error ul-
timately makes it into the user logs. The application master marks the task attempt as
failed , and frees up the container so its resources are available for another task.
For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is marked as
failed. This behavior is governed by the stream.non.zero.exit.is.failure
property (the default is true ).
Another failure mode is the sudden exit of the task JVM — perhaps there is a JVM bug that
causes the JVM to exit for a particular set of circumstances exposed by the MapReduce
user code. In this case, the node manager notices that the process has exited and informs
the application master so it can mark the attempt as failed.
Hanging tasks are dealt with differently. The application master notices that it hasn't re-
ceived a progress update for a while and proceeds to mark the task as failed. The task JVM
process will be killed automatically after this period. [ 53 ] The timeout period after which
tasks are considered failed is normally 10 minutes and can be configured on a per-job basis
(or a cluster basis) by setting the mapreduce.task.timeout property to a value in
milliseconds.
Setting the timeout to a value of zero disables the timeout, so long-running tasks are never
marked as failed. In this case, a hanging task will never free up its container, and over time
there may be cluster slowdown as a result. This approach should therefore be avoided, and
making sure that a task is reporting progress periodically should suffice (see What Consti-
tutes Progress in MapReduce? ) .
When the application master is notified of a task attempt that has failed, it will reschedule
execution of the task. The application master will try to avoid rescheduling the task on a
node manager where it has previously failed. Furthermore, if a task fails four times, it will
not be retried again. This value is configurable. The maximum number of attempts to run a
task is controlled by the mapreduce.map.maxattempts property for map tasks and
Search WWH ::




Custom Search