Managing Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

to HDFS. These include adding/removing nodes (capacity) and recovery from

NameNode failure. We end the chapter with a section on setting up a scheduler to

manage multiple running jobs.

8.1

Setting up parameter values

for practical use

Hadoop has many different parameters. Their default values tend to target running in

standalone mode. They also tend to veer toward being idiotproof. The default values

are more likely to work on more systems without causing any errors. However, often-

times they're far from optimal in a production cluster. Table 8.1 shows some of the

system properties that you'll want to change for a production cluster.

Table 8.1 Hadoop

properties that you can tune for a production cluster

Property

Description

Suggested value

dfs.name.dir

Directory in NameNode's local filesystem to

store HDFS's metadata

/home/hadoop/

dfs/name

dfs.data.dir

Directory in a DataNode's local filesystem to

store HDFS's file blocks

/home/hadoop/

dfs/data

mapred.system.dir

Directory in HDFS for storing shared MapReduce

system files

/hadoop/

mapred/system

mapred.local.dir

Directory in a TaskNode's local filesystem to

store temporary data

mapred.

tasktracker.

{map|reduce}

.tasks.maximum

Maximum number of map and reduce tasks that

can run simultaneously in a TaskTracker

hadoop.tmp.dir

Temporary Hadoop directories

/home/hadoop/

tmp

dfs.datanode.du

.reserved

Minimum amount of free space a DataNode

should have

1073741824

mapred.child.

java.opts

Heap size allocated to each child task

-Xmx512m

Number of reduce tasks for a job

mapred.reduce.

tasks

The default values for dfs.name.dir and dfs.data.dir point to directories under

/tmp , which is intended only for temporary storage in almost all Unix systems. You will

definitely want to change those properties for a production cluster. 1 In addition, these

properties can take comma-separated lists of directories. In the case of dfs.name.

dir , multiple directories are good for backup purposes. If a DataNode

has multiple

drives, you should have a data directory in each one and list them all in dfs.data.dir .

1 The rationale for using /tmp illustrates how default values are idiotproof. Every Unix system has the /tmp

directory so you won't get a “directory not found” error.

Hadoop in Action

Search WWH ::

Custom Search

Home