Storage Provisioning and Networking - Deploying and Managing a Cloud Infrastructure

Information Technology Reference

In-Depth Information

kept in a manner that distributes load evenly across the cluster. The NameNode executes

file system namespace operations like opening, closing, and renaming files and direc-

tories. The NameNode also determines mapping of blocks to DataNodes. DataNodes

perform block creation, deletion, and replication as directed by the NameNode. The

DataNodes serve read and write requests from clients.

HDFS is built using the Java language. Any machine that supports Java can run the

NameNode and DataNodes. One of the major advantages of Java is that the HDFS can

be deployed over commodity hardware running the GNU/Linux operating system (OS).

■

HDFS exposes a typical hierarchical file system namespace. Directories and files can

be created and removed. Users can move directories and files from one path to the

other. Directories and files can also be renamed. User quotas, access permissions, and

hard and soft links are not yet supported by the HDFS. Users can implement these fea-

tures or use the underlying OS. The NameNode maintains the file system namespace,

recording changes to its properties. Users can specify replication factors using a con-

figuration file on the NameNode, which determines the number of replicas.

■

Architecture

Figure 9.5 shows the architectural overview of the HDFS system, including the NameNode

and DataNodes. The clients can execute read and write operations, and it is up to the

NameNode to maintain the replicas, including when and where to place new replicas. The

NameNode also receives a heartbeat and block report from each DataNode in the cluster.

A block report contains a list of all blocks stored on a DataNode. The purpose of the heart-

beat is to check if the DataNode is still alive and functioning properly. A faulty node is

immediately blacklisted. The purpose of the block report is to make the NameNode aware

of what replicas are located on which DataNodes. It also helps the NameNode in making

future decisions about where to put new replicas.

Data Replication

HDFS reliably stores each file as a sequence of blocks across many DataNodes in a cluster

(depending on the replication factor). All blocks in a file are the same size except the last

block. Replication is meant to provide fault tolerance and recoverability from disaster. The

block size and replication factor ( dfs.replication ) are configurable using a configuration

file on the NameNode ( hdfs-site.xml ). The replication factor can be specified on a per-file

basis and can be changed at any time. However, the replication factor cannot exceed the

number of DataNodes. Files in HDFS are write-once and strictly have one writer process

at any given time. This, as mentioned previously, avoids the tedious tasks of serialization/

deserialization, file-hold locking mechanisms, and repeated verification of the continuously

growing file.

The optimization of replica placement determines HDFS reliability and performance.

This is a feature that distinguishes HDFS from most of the other distributed file systems.

The purpose of a rack-aware replica placement policy is to improve HDFS data reliability

and availability and provide optimum network bandwidth.

Search WWH ::

Custom Search

Home