Databases Reference
In-Depth Information
to HDFS. These include adding/removing nodes (capacity) and recovery from
NameNode failure. We end the chapter with a section on setting up a scheduler to
manage multiple running jobs.
8.1
Setting up parameter values
for practical use
Hadoop has many different parameters. Their default values tend to target running in
standalone mode. They also tend to veer toward being idiotproof. The default values
are more likely to work on more systems without causing any errors. However, often-
times they're far from optimal in a production cluster. Table 8.1 shows some of the
system properties that you'll want to change for a production cluster.
Table 8.1 Hadoop
properties that you can tune for a production cluster
Property
Description
Suggested value
dfs.name.dir
Directory in NameNode's local filesystem to
store HDFS's metadata
/home/hadoop/
dfs/name
dfs.data.dir
Directory in a DataNode's local filesystem to
store HDFS's file blocks
/home/hadoop/
dfs/data
mapred.system.dir
Directory in HDFS for storing shared MapReduce
system files
/hadoop/
mapred/system
mapred.local.dir
Directory in a TaskNode's local filesystem to
store temporary data
mapred.
tasktracker.
{map|reduce}
.tasks.maximum
Maximum number of map and reduce tasks that
can run simultaneously in a TaskTracker
hadoop.tmp.dir
Temporary Hadoop directories
/home/hadoop/
tmp
dfs.datanode.du
.reserved
Minimum amount of free space a DataNode
should have
1073741824
mapred.child.
java.opts
Heap size allocated to each child task
-Xmx512m
Number of reduce tasks for a job
mapred.reduce.
tasks
The default values for dfs.name.dir and dfs.data.dir point to directories under
/tmp , which is intended only for temporary storage in almost all Unix systems. You will
definitely want to change those properties for a production cluster. 1 In addition, these
properties can take comma-separated lists of directories. In the case of dfs.name.
dir , multiple directories are good for backup purposes. If a DataNode
has multiple
drives, you should have a data directory in each one and list them all in dfs.data.dir .
1 The rationale for using /tmp illustrates how default values are idiotproof. Every Unix system has the /tmp
directory so you won't get a “directory not found” error.
 
Search WWH ::




Custom Search