Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

each file. To maintain file and block integrity, once a block is assigned to a DataNode, two files are

created to represent each replica in the local host's native file system. The first file contains the data

itself and the second file is the block's metadata including checksums for each data block and genera-

tion stamp.

Replication and recovery

In the original design of HDFS there was a single NameNode for each cluster, which became the sin-

gle point of failure. This has been addressed in the recent releases of HDFS where NameNode repli-

cation is now a standard feature like DataNode replication.

Communication and management

The most critical component within the HDFS architecture is the communication and management

between a NameNode and DataNodes. This aspect is implemented as a protocol of handshakes and

system IDs. Upon initial creation and formatting, a namespace ID is assigned to the file system on

the NameNode. This ID is persistently stored on all the nodes across the cluster. DataNodes similarly

are assigned a unique storage ID on the initial creation and registration with a NameNode. This stor-

age ID never changes and will be persistent event if the DataNode is started on a different IP address

or port.

During the startup process, the NameNode completes its namespace refresh and is ready to

establish the communication with the DataNode. To ensure that each DataNode that connects to the

NameNode is the correct DataNode, there is a series of verification steps:

●

The DataNode identifies itself to the NameNode with a handshake and verifies its namespace ID

and software version.

●

If either does not match with the NameNode, the DataNode automatically shuts down.

●

The signature verification process prevents incorrect nodes from joining the cluster and

automatically preserves the integrity of the file system.

●

The signature verification process also is an assurance check for consistency of software versions

between the NameNode and DataNode, since an incompatible version can cause data corruption

or loss.

●

After the handshake and validation on the NameNode, a DataNode sends a block report. A block

report contains the block ID, the length for each block replica, and the generation stamp.

●

The first block report is sent immediately upon the DataNode registration.

●

Subsequently, hourly updates of the block report are sent to the NameNode, which provide the

view of where block replicas are located on the cluster.

●

When a new DataNode is added and initialized, since it does not have a namespace ID, it is

permitted to join the cluster and receive the cluster's namespace ID.

Heartbeats

The connectivity between the NameNode and a DataNode are managed by the persistemt heartbeats

that are sent by the DataNode every three seconds. The heartbeat provides the NameNode confirma-

tion about the availability of the blocks and the replicas of the DataNode. Additionally, heartbeats

also carry information about total storage capacity, storage in use, and the number of data transfers

currently in progress. These statistics are by the NameNode for managing space allocation and load

balancing.

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home