Databases Reference
In-Depth Information
each file. To maintain file and block integrity, once a block is assigned to a DataNode, two files are
created to represent each replica in the local host's native file system. The first file contains the data
itself and the second file is the block's metadata including checksums for each data block and genera-
tion stamp.
Replication and recovery
In the original design of HDFS there was a single NameNode for each cluster, which became the sin-
gle point of failure. This has been addressed in the recent releases of HDFS where NameNode repli-
cation is now a standard feature like DataNode replication.
Communication and management
The most critical component within the HDFS architecture is the communication and management
between a NameNode and DataNodes. This aspect is implemented as a protocol of handshakes and
system IDs. Upon initial creation and formatting, a namespace ID is assigned to the file system on
the NameNode. This ID is persistently stored on all the nodes across the cluster. DataNodes similarly
are assigned a unique storage ID on the initial creation and registration with a NameNode. This stor-
age ID never changes and will be persistent event if the DataNode is started on a different IP address
or port.
During the startup process, the NameNode completes its namespace refresh and is ready to
establish the communication with the DataNode. To ensure that each DataNode that connects to the
NameNode is the correct DataNode, there is a series of verification steps:
The DataNode identifies itself to the NameNode with a handshake and verifies its namespace ID
and software version.
If either does not match with the NameNode, the DataNode automatically shuts down.
The signature verification process prevents incorrect nodes from joining the cluster and
automatically preserves the integrity of the file system.
The signature verification process also is an assurance check for consistency of software versions
between the NameNode and DataNode, since an incompatible version can cause data corruption
or loss.
After the handshake and validation on the NameNode, a DataNode sends a block report. A block
report contains the block ID, the length for each block replica, and the generation stamp.
The first block report is sent immediately upon the DataNode registration.
Subsequently, hourly updates of the block report are sent to the NameNode, which provide the
view of where block replicas are located on the cluster.
When a new DataNode is added and initialized, since it does not have a namespace ID, it is
permitted to join the cluster and receive the cluster's namespace ID.
Heartbeats
The connectivity between the NameNode and a DataNode are managed by the persistemt heartbeats
that are sent by the DataNode every three seconds. The heartbeat provides the NameNode confirma-
tion about the availability of the blocks and the replicas of the DataNode. Additionally, heartbeats
also carry information about total storage capacity, storage in use, and the number of data transfers
currently in progress. These statistics are by the NameNode for managing space allocation and load
balancing.
Search WWH ::




Custom Search