Managing Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

When you add a new DataNode, it will initially be empty, whereas existing DataNodes

will already be filled to some capacity. The filesystem

is considered unbalanced . New files

will likely go to the new node, but their replicated blocks will still go to the old nodes.

One should proactively start the HDFS

balancer to balance the cluster for optimal

performance. Run the balancer script:

bin/start-balancer.sh

The script will run in the background until the cluster is balanced. An administrator

can also terminate it earlier by running

bin/stop-balancer.sh

A cluster is considered balanced when the utilization rates of all the DataNodes are

within the range of the average utilization rate plus or minus a threshold. This thresh-

old is 10 percent by default. You can specify a different threshold when you start the

balancer script. For example, to set the threshold to 5 percent for a more evenly

distributed cluster, start the balancer with

bin/start-balancer.sh -threshold 5

As balancing can be network intensive, we recommend doing it overnight or over a

weekend when your cluster may be less busy. Alternatively, you can set the dfs.balance.

bandwidthPerSec

configuration parameter to limit the bandwidth devoted to balancing.

8.8

Managing NameNode and Secondary NameNode

NameNode

is one of the most important components in the HDFS architecture. It

holds the filesystem's metadata and caches the cluster's blockmap in RAM for rea-

sonable performance. When you have anything other than a tiny cluster, you should

dedicate a machine to run as NameNode and don't put any DataNode, JobTracker,

or TaskTracker service on it. This NameNode machine should be the most powerful

machine in the cluster. Give it as much RAM as possible. Although DataNodes may

have higher performance with JBOD

disk drives, you should definitely use RAID

drives

in your NameNode

for higher reliability against any single drive failure.

One approach to reducing the burden on the NameNode is to reduce the amount

of filesystem

by increasing the block size. Doubling the block size will almost

half the amount of metadata. Unfortunately, this also decreases parallelism for files that

are not large. The ideal block size will depend on your specific deployment. The block

size is set in the configuration parameter dfs.block.size . For example, to double the

block size from the default 64 MB to 128 MB, set dfs.block.size to 134217728.

By default, the Secondary NameNode 3 and the NameNode run on the same

machine. For moderate size clusters

metadata

(10 or more nodes), you should separate the

3

As of this writing, the Secondary NameNode is slated to be deprecated by version 0.21 of Hadoop, which

should be released as this topic goes to press. The Secondary NameNode will be replaced by a more robust

design for warm standby. You should check the online documentation of the version of Hadoop you're

using to confirm whether it's still using Secondary NameNode or not. The particular patch for this change

is at https://issues.apache.org/jira/browse/HADOOP-4539.

Hadoop in Action

Search WWH ::

Custom Search

Home