Information Technology Reference
In-Depth Information
kept in a manner that distributes load evenly across the cluster. The NameNode executes
file system namespace operations like opening, closing, and renaming files and direc-
tories. The NameNode also determines mapping of blocks to DataNodes. DataNodes
perform block creation, deletion, and replication as directed by the NameNode. The
DataNodes serve read and write requests from clients.
HDFS is built using the Java language. Any machine that supports Java can run the
NameNode and DataNodes. One of the major advantages of Java is that the HDFS can
be deployed over commodity hardware running the GNU/Linux operating system (OS).
HDFS exposes a typical hierarchical file system namespace. Directories and files can
be created and removed. Users can move directories and files from one path to the
other. Directories and files can also be renamed. User quotas, access permissions, and
hard and soft links are not yet supported by the HDFS. Users can implement these fea-
tures or use the underlying OS. The NameNode maintains the file system namespace,
recording changes to its properties. Users can specify replication factors using a con-
figuration file on the NameNode, which determines the number of replicas.
Architecture
Figure 9.5 shows the architectural overview of the HDFS system, including the NameNode
and DataNodes. The clients can execute read and write operations, and it is up to the
NameNode to maintain the replicas, including when and where to place new replicas. The
NameNode also receives a heartbeat and block report from each DataNode in the cluster.
A block report contains a list of all blocks stored on a DataNode. The purpose of the heart-
beat is to check if the DataNode is still alive and functioning properly. A faulty node is
immediately blacklisted. The purpose of the block report is to make the NameNode aware
of what replicas are located on which DataNodes. It also helps the NameNode in making
future decisions about where to put new replicas.
Data Replication
HDFS reliably stores each file as a sequence of blocks across many DataNodes in a cluster
(depending on the replication factor). All blocks in a file are the same size except the last
block. Replication is meant to provide fault tolerance and recoverability from disaster. The
block size and replication factor ( dfs.replication ) are configurable using a configuration
file on the NameNode ( hdfs-site.xml ). The replication factor can be specified on a per-file
basis and can be changed at any time. However, the replication factor cannot exceed the
number of DataNodes. Files in HDFS are write-once and strictly have one writer process
at any given time. This, as mentioned previously, avoids the tedious tasks of serialization/
deserialization, file-hold locking mechanisms, and repeated verification of the continuously
growing file.
The optimization of replica placement determines HDFS reliability and performance.
This is a feature that distinguishes HDFS from most of the other distributed file systems.
The purpose of a rack-aware replica placement policy is to improve HDFS data reliability
and availability and provide optimum network bandwidth.
Search WWH ::




Custom Search