Setting Up a Hadoop Cluster - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

</property>

</configuration>

HDFS

To run HDFS, you need to designate one machine as a namenode. In this case, the prop-

erty fs.defaultFS is an HDFS filesystem URI whose host is the namenode's host-

name or IP address and whose port is the port that the namenode will listen on for RPCs.

If no port is specified, the default of 8020 is used.

The fs.defaultFS property also doubles as specifying the default filesystem. The de-

fault filesystem is used to resolve relative paths, which are handy to use because they save

typing (and avoid hardcoding knowledge of a particular namenode's address). For ex-

ample, with the default filesystem defined in Example 10-1 , the relative URI /a/b is re-

solved to hdfs://namenode/a/b .

NOTE

If you are running HDFS, the fact that fs.defaultFS is used to specify both the HDFS namenode

and the default filesystem means HDFS has to be the default filesystem in the server configuration. Bear

in mind, however, that it is possible to specify a different filesystem as the default in the client configura-

tion, for convenience.

For example, if you use both HDFS and S3 filesystems, then you have a choice of specifying either as

the default in the client configuration, which allows you to refer to the default with a relative URI and

the other with an absolute URI.

There are a few other configuration properties you should set for HDFS: those that set the

storage directories for the namenode and for datanodes. The property

dfs.namenode.name.dir specifies a list of directories where the namenode stores

persistent filesystem metadata (the edit log and the filesystem image). A copy of each

metadata file is stored in each directory for redundancy. It's common to configure

dfs.namenode.name.dir so that the namenode metadata is written to one or two

local disks, as well as a remote disk, such as an NFS-mounted directory. Such a setup

guards against failure of a local disk and failure of the entire namenode, since in both

cases the files can be recovered and used to start a new namenode. (The secondary na-

menode takes only periodic checkpoints of the namenode, so it does not provide an up-to-

date backup of the namenode.)

You should also set the dfs.datanode.data.dir property, which specifies a list of

directories for a datanode to store its blocks in. Unlike the namenode, which uses multiple

directories for redundancy, a datanode round-robins writes between its storage directories,

Search WWH ::

Custom Search

Home