Database Reference
In-Depth Information
variance to use for the rebalancing process. The default threshold level is
10%, as shown here:
hadoop balancer -threshold .1
The balancer determines an average space utilization across the cluster.
Nodes are considered over- or underutilized if their space usage varies by
more than the threshold from the average space utilization. The balance
runs until one of the following occurs:
• All the nodes in the cluster have been balanced.
• It has exceeded three iterations without making progress on balancing.
• The user who started the balancer aborts it by pressing Ctrl+C.
Balancing the data across nodes is an important step to maintaining the
performance of the cluster, and it should be carried out whenever there are
significant changes to the nodes in a cluster.
Summary
In this chapter, the background of the HDFS file system has been covered,
along with some of the underlying details, including how NameNodes and
DataNodes interact to store information in HDFS. The basic commands
for working with and administering an HDFS file system—such as ls for
listing files, get and put for moving files in and out of HDFS, and rm for
removingunnecessaryfiles—havebeencovered.Inaddition,someadvanced
administrative topics, like balancing and data movement, which are
important for maintaining your HDFS cluster, have been covered. In the
next chapter, these topics will be built on with a discussion of how the
Hive application runs on top of the HDFS file system while presenting the
appearance of a traditional RDBMS to applications.
Search WWH ::




Custom Search