Databases Reference
In-Depth Information
When you add a new DataNode, it will initially be empty, whereas existing DataNodes
will already be filled to some capacity. The filesystem
is considered unbalanced . New files
will likely go to the new node, but their replicated blocks will still go to the old nodes.
One should proactively start the HDFS
balancer to balance the cluster for optimal
performance. Run the balancer script:
bin/start-balancer.sh
The script will run in the background until the cluster is balanced. An administrator
can also terminate it earlier by running
bin/stop-balancer.sh
A cluster is considered balanced when the utilization rates of all the DataNodes are
within the range of the average utilization rate plus or minus a threshold. This thresh-
old is 10 percent by default. You can specify a different threshold when you start the
balancer script. For example, to set the threshold to 5 percent for a more evenly
distributed cluster, start the balancer with
bin/start-balancer.sh -threshold 5
As balancing can be network intensive, we recommend doing it overnight or over a
weekend when your cluster may be less busy. Alternatively, you can set the dfs.balance.
bandwidthPerSec
configuration parameter to limit the bandwidth devoted to balancing.
8.8
Managing NameNode and Secondary NameNode
NameNode
is one of the most important components in the HDFS architecture. It
holds the filesystem's metadata and caches the cluster's blockmap in RAM for rea-
sonable performance. When you have anything other than a tiny cluster, you should
dedicate a machine to run as NameNode and don't put any DataNode, JobTracker,
or TaskTracker service on it. This NameNode machine should be the most powerful
machine in the cluster. Give it as much RAM as possible. Although DataNodes may
have higher performance with JBOD
disk drives, you should definitely use RAID
drives
in your NameNode
for higher reliability against any single drive failure.
One approach to reducing the burden on the NameNode is to reduce the amount
of filesystem
by increasing the block size. Doubling the block size will almost
half the amount of metadata. Unfortunately, this also decreases parallelism for files that
are not large. The ideal block size will depend on your specific deployment. The block
size is set in the configuration parameter dfs.block.size . For example, to double the
block size from the default 64 MB to 128 MB, set dfs.block.size to 134217728.
By default, the Secondary NameNode 3 and the NameNode run on the same
machine. For moderate size clusters
metadata
(10 or more nodes), you should separate the
3
As of this writing, the Secondary NameNode is slated to be deprecated by version 0.21 of Hadoop, which
should be released as this topic goes to press. The Secondary NameNode will be replaced by a more robust
design for warm standby. You should check the online documentation of the version of Hadoop you're
using to confirm whether it's still using Secondary NameNode or not. The particular patch for this change
is at https://issues.apache.org/jira/browse/HADOOP-4539.
 
Search WWH ::




Custom Search