Database Reference
In-Depth Information
mapreduce.reduce.maxattempts for reduce tasks. By default, if any task fails
four times (or whatever the maximum number of attempts is configured to), the whole job
fails.
For some applications, it is undesirable to abort the job if a few tasks fail, as it may be
possible to use the results of the job despite some failures. In this case, the maximum per-
centage of tasks that are allowed to fail without triggering job failure can be set for the
job. Map tasks and reduce tasks are controlled independently, using the mapre-
duce.map.failures.maxpercent and mapre-
duce.reduce.failures.maxpercent properties.
A task attempt may also be killed , which is different from it failing. A task attempt may be
killed because it is a speculative duplicate (for more information on this topic, see Specu-
lative Execution ) , or because the node manager it was running on failed and the applica-
tion master marked all the task attempts running on it as killed. Killed task attempts do not
count against the number of attempts to run the task (as set by mapre-
duce.map.maxattempts and mapreduce.reduce.maxattempts ), because it
wasn't the task's fault that an attempt was killed.
Users may also kill or fail task attempts using the web UI or the command line (type
mapred job to see the options). Jobs may be killed by the same mechanisms.
Application Master Failure
Just like MapReduce tasks are given several attempts to succeed (in the face of hardware
or network failures), applications in YARN are retried in the event of failure. The maxim-
um number of attempts to run a MapReduce application master is controlled by the
mapreduce.am.max-attempts property. The default value is 2, so if a MapReduce
application master fails twice it will not be tried again and the job will fail.
YARN imposes a limit for the maximum number of attempts for any YARN application
master running on the cluster, and individual applications may not exceed this limit. The
limit is set by yarn.resourcemanager.am.max-attempts and defaults to 2, so
if you want to increase the number of MapReduce application master attempts, you will
have to increase the YARN setting on the cluster, too.
The way recovery works is as follows. An application master sends periodic heartbeats to
the resource manager, and in the event of application master failure, the resource manager
will detect the failure and start a new instance of the master running in a new container
(managed by a node manager). In the case of the MapReduce application master, it will
use the job history to recover the state of the tasks that were already run by the (failed) ap-
Search WWH ::




Custom Search