How MapReduce Works - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

plication so they don't have to be rerun. Recovery is enabled by default, but can be dis-

abled by setting yarn.app.mapreduce.am.job.recovery.enable to false .

The MapReduce client polls the application master for progress reports, but if its applica-

tion master fails, the client needs to locate the new instance. During job initialization, the

client asks the resource manager for the application master's address, and then caches it so

it doesn't overload the resource manager with a request every time it needs to poll the ap-

plication master. If the application master fails, however, the client will experience a

timeout when it issues a status update, at which point the client will go back to the re-

source manager to ask for the new application master's address. This process is transpar-

ent to the user.

Node Manager Failure

If a node manager fails by crashing or running very slowly, it will stop sending heartbeats

to the resource manager (or send them very infrequently). The resource manager will no-

tice a node manager that has stopped sending heartbeats if it hasn't received one for 10

minutes (this is configured, in milliseconds, via the

yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms

property) and remove it from its pool of nodes to schedule containers on.

Any task or application master running on the failed node manager will be recovered us-

ing the mechanisms described in the previous two sections. In addition, the application

master arranges for map tasks that were run and completed successfully on the failed node

manager to be rerun if they belong to incomplete jobs, since their intermediate output

residing on the failed node manager's local filesystem may not be accessible to the reduce

task.

Node managers may be blacklisted if the number of failures for the application is high,

even if the node manager itself has not failed. Blacklisting is done by the application mas-

ter, and for MapReduce the application master will try to reschedule tasks on different

nodes if more than three tasks fail on a node manager. The user may set the threshold with

the mapreduce.job.maxtaskfailures.per.tracker job property.

NOTE

Note that the resource manager does not do blacklisting across applications (at the time of writing), so

tasks from new jobs may be scheduled on bad nodes even if they have been blacklisted by an application

master running an earlier job.

Search WWH ::

Custom Search

Home