Database Reference
In-Depth Information
Data management
HDFS uses the master-slave mechanism to distribute data across multiple servers. The
master node is usually backed by a powerful machine so that it does not fail. The slave
machines are data nodes, these are commodity hardware. The reason behind having a
powerful master node is we do not want it to go down as it's a single point failure. If the
master node (that is, NameNode) goes down, the storage is down—unlike the Cassandra
model. To load the data to HDFS, the client connects to the master node and sends an up-
load request. The master node tells the client to send parts of the data to various data
nodes. Note that data does not stream through the master node. It just directs the client to
appropriate data nodes and maintains the metadata about the location of various parts of a
file. The following diagram shows how the client makes a request to NameNode to write a
block. NameNode returns the nodes where the block is to be written. The client picks one
DataNode from the node's list in the previous step and forwards it to other nodes:
.
There are two processes one needs to know about to understand how the data is distrib-
uted and managed by HDFS.
NameNode
The NameNode process is the one that runs on a master server. Its job is to keep metadata
about the files that are stored in the data nodes. If NameNode is down, the slaves have no
idea how to make a sense of the block stored. Therefore, it is crucial to have NameNode
on redundant hardware. In general, in a Hadoop cluster, there is just one master
NameNode.
DataNodes
DataNodes are the slaves. They are the machines that actually contain the data. The
DataNode process manages the data blocks on the local machine. DataNode keeps com-
Search WWH ::




Custom Search