Database Reference
In-Depth Information
plication so they don't have to be rerun. Recovery is enabled by default, but can be dis-
abled by setting yarn.app.mapreduce.am.job.recovery.enable to false .
The MapReduce client polls the application master for progress reports, but if its applica-
tion master fails, the client needs to locate the new instance. During job initialization, the
client asks the resource manager for the application master's address, and then caches it so
it doesn't overload the resource manager with a request every time it needs to poll the ap-
plication master. If the application master fails, however, the client will experience a
timeout when it issues a status update, at which point the client will go back to the re-
source manager to ask for the new application master's address. This process is transpar-
ent to the user.
Node Manager Failure
If a node manager fails by crashing or running very slowly, it will stop sending heartbeats
to the resource manager (or send them very infrequently). The resource manager will no-
tice a node manager that has stopped sending heartbeats if it hasn't received one for 10
minutes (this is configured, in milliseconds, via the
yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms
property) and remove it from its pool of nodes to schedule containers on.
Any task or application master running on the failed node manager will be recovered us-
ing the mechanisms described in the previous two sections. In addition, the application
master arranges for map tasks that were run and completed successfully on the failed node
manager to be rerun if they belong to incomplete jobs, since their intermediate output
residing on the failed node manager's local filesystem may not be accessible to the reduce
task.
Node managers may be blacklisted if the number of failures for the application is high,
even if the node manager itself has not failed. Blacklisting is done by the application mas-
ter, and for MapReduce the application master will try to reschedule tasks on different
nodes if more than three tasks fail on a node manager. The user may set the threshold with
the mapreduce.job.maxtaskfailures.per.tracker job property.
NOTE
Note that the resource manager does not do blacklisting across applications (at the time of writing), so
tasks from new jobs may be scheduled on bad nodes even if they have been blacklisted by an application
master running an earlier job.
Search WWH ::




Custom Search