Setting Up a Hadoop Cluster - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

YARN

To run YARN, you need to designate one machine as a resource manager. The simplest

way to do this is to set the property yarn.resourcemanager.hostname to the

hostname or IP address of the machine running the resource manager. Many of the re-

source manager's server addresses are derived from this property. For example,

yarn.resourcemanager.address takes the form of a host-port pair, and the host

defaults to yarn.resourcemanager.hostname . In a MapReduce client configura-

tion, this property is used to connect to the resource manager over RPC.

During a MapReduce job, intermediate data and working files are written to temporary

local files. Because this data includes the potentially very large output of map tasks, you

need to ensure that the yarn.nodemanager.local-dirs property, which controls

the location of local temporary storage for YARN containers, is configured to use disk

partitions that are large enough. The property takes a comma-separated list of directory

names, and you should use all available local disks to spread disk I/O (the directories are

used in round-robin fashion). Typically, you will use the same disks and partitions (but

different directories) for YARN local storage as you use for datanode block storage, as

governed by the dfs.datanode.data.dir property, which was discussed earlier.

Unlike MapReduce 1, YARN doesn't have tasktrackers to serve map outputs to reduce

tasks, so for this function it relies on shuffle handlers, which are long-running auxiliary

services running in node managers. Because YARN is a general-purpose service, the

MapReduce shuffle handlers need to be enabled explicitly in yarn-site.xml by setting the

yarn.nodemanager.aux-services property to mapreduce_shuffle .

Table 10-3 summarizes the important configuration properties for YARN. The resource-

related settings are covered in more detail in the next sections.

Search WWH ::

Custom Search

Home