YARN - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

duce YARN application does, which we'll look at in more detail in Anatomy of a MapRe-

duce Job Run .

Notice from Figure 4-2 that YARN itself does not provide any way for the parts of the ap-

plication (client, master, process) to communicate with one another. Most nontrivial

YARN applications use some form of remote communication (such as Hadoop's RPC lay-

er) to pass status updates and results back to the client, but these are specific to the applic-

ation.

Resource Requests

YARN has a flexible model for making resource requests. A request for a set of containers

can express the amount of computer resources required for each container (memory and

CPU), as well as locality constraints for the containers in that request.

Locality is critical in ensuring that distributed data processing algorithms use the cluster

bandwidth efficiently, [ 36 ] so YARN allows an application to specify locality constraints

for the containers it is requesting. Locality constraints can be used to request a container

on a specific node or rack, or anywhere on the cluster (off-rack).

Sometimes the locality constraint cannot be met, in which case either no allocation is

made or, optionally, the constraint can be loosened. For example, if a specific node was

requested but it is not possible to start a container on it (because other containers are run-

ning on it), then YARN will try to start a container on a node in the same rack, or, if that's

not possible, on any node in the cluster.

In the common case of launching a container to process an HDFS block (to run a map task

in MapReduce, say), the application will request a container on one of the nodes hosting

the block's three replicas, or on a node in one of the racks hosting the replicas, or, failing

that, on any node in the cluster.

A YARN application can make resource requests at any time while it is running. For ex-

ample, an application can make all of its requests up front, or it can take a more dynamic

approach whereby it requests more resources dynamically to meet the changing needs of

the application.

Spark takes the first approach, starting a fixed number of executors on the cluster (see

Spark on YARN ) . MapReduce, on the other hand, has two phases: the map task containers

are requested up front, but the reduce task containers are not started until later. Also, if

any tasks fail, additional containers will be requested so the failed tasks can be rerun.

Search WWH ::

Custom Search

Home