Storing and Managing Data in HDFS - Microsoft Big Data Solutions

Database Reference

In-Depth Information

to which DataNodes. These blocks of data are usually 64MB, but the setting

is configurable.

The DataNode is responsible for the creation of blocks of data in its physical

storageandforthedeletionofthoseblocks.Itisalsoresponsibleforcreation

of replica blocks from other nodes. The NameNode coordinates this activity,

telling the DataNode what blocks to create, delete, or replicate. DataNodes

communicate with the NameNode by sending a regular “heartbeat”

communication over the network. This heartbeat indicates that the

DataNode is operating correctly. A block report is also delivered with the

heartbeat and provides a list of all the blocks stored on the DataNode.

The NameNode maintains a transaction history of all changes to the file

system, known as the EditLog. It also maintains a file, referred to as the

FsImage, that contains the file system metadata. The FsImage and EditLog

files are read by the NameNode when it starts up, and the EditLog's

transaction history is applied to the FsImage. This brings the FsImage

up-to-date with the latest changes recorded by the NameNode. Once the

FsImage is updated, it is written back to the file system, and the EditLog

is cleared. At this point, the NameNode can begin accepting requests. This

process (shown in Figure 5.1 ) is referred to as checkpointing, and it is run

only on startup. It can have some performance impact if the NameNode has

accumulated a large EditLog.

Figure 5.1 The checkpointing process

The NameNode is a crucial component of any HDFS cluster. Without a

functioning NameNode, the data cannot be accessed. That means that the

NameNode is a single point of failure for the cluster. Because of that, the

NameNode is one place that using a more fault-tolerant hardware setup is

Search WWH ::

Custom Search

Home