Database Reference
In-Depth Information
duce YARN application does, which we'll look at in more detail in Anatomy of a MapRe-
duce Job Run .
Notice from Figure 4-2 that YARN itself does not provide any way for the parts of the ap-
plication (client, master, process) to communicate with one another. Most nontrivial
YARN applications use some form of remote communication (such as Hadoop's RPC lay-
er) to pass status updates and results back to the client, but these are specific to the applic-
ation.
Resource Requests
YARN has a flexible model for making resource requests. A request for a set of containers
can express the amount of computer resources required for each container (memory and
CPU), as well as locality constraints for the containers in that request.
Locality is critical in ensuring that distributed data processing algorithms use the cluster
bandwidth efficiently, [ 36 ] so YARN allows an application to specify locality constraints
for the containers it is requesting. Locality constraints can be used to request a container
on a specific node or rack, or anywhere on the cluster (off-rack).
Sometimes the locality constraint cannot be met, in which case either no allocation is
made or, optionally, the constraint can be loosened. For example, if a specific node was
requested but it is not possible to start a container on it (because other containers are run-
ning on it), then YARN will try to start a container on a node in the same rack, or, if that's
not possible, on any node in the cluster.
In the common case of launching a container to process an HDFS block (to run a map task
in MapReduce, say), the application will request a container on one of the nodes hosting
the block's three replicas, or on a node in one of the racks hosting the replicas, or, failing
that, on any node in the cluster.
A YARN application can make resource requests at any time while it is running. For ex-
ample, an application can make all of its requests up front, or it can take a more dynamic
approach whereby it requests more resources dynamically to meet the changing needs of
the application.
Spark takes the first approach, starting a fixed number of executors on the cluster (see
Spark on YARN ) . MapReduce, on the other hand, has two phases: the map task containers
are requested up front, but the reduce task containers are not started until later. Also, if
any tasks fail, additional containers will be requested so the failed tasks can be rerun.
Search WWH ::




Custom Search