Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Data management

HDFS uses the master-slave mechanism to distribute data across multiple servers. The

master node is usually backed by a powerful machine so that it does not fail. The slave

machines are data nodes, these are commodity hardware. The reason behind having a

powerful master node is we do not want it to go down as it's a single point failure. If the

master node (that is, NameNode) goes down, the storage is down—unlike the Cassandra

model. To load the data to HDFS, the client connects to the master node and sends an up-

load request. The master node tells the client to send parts of the data to various data

nodes. Note that data does not stream through the master node. It just directs the client to

appropriate data nodes and maintains the metadata about the location of various parts of a

file. The following diagram shows how the client makes a request to NameNode to write a

block. NameNode returns the nodes where the block is to be written. The client picks one

DataNode from the node's list in the previous step and forwards it to other nodes:

.

There are two processes one needs to know about to understand how the data is distrib-

uted and managed by HDFS.

NameNode

The NameNode process is the one that runs on a master server. Its job is to keep metadata

about the files that are stored in the data nodes. If NameNode is down, the slaves have no

idea how to make a sense of the block stored. Therefore, it is crucial to have NameNode

on redundant hardware. In general, in a Hadoop cluster, there is just one master

NameNode.

DataNodes

DataNodes are the slaves. They are the machines that actually contain the data. The

DataNode process manages the data blocks on the local machine. DataNode keeps com-

Search WWH ::

Custom Search

Home