Cloudware Application Development - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

• Scalability —Linear scalability at a storage layer is needed to utilize

parallel processing at its optimum level. Designing for 100% linear

scalability.

• Fault tolerance —The automatic ability to recover from failure and

complete the processing of data.

• Cross-platform compatibility —The ability to integrate across multiple

architecture platforms.

• Compute and storage in one environment —Data and computation colo-

cated in the same architecture removing redundant I/O and exces-

sive disk access.

HDFS is analogous to GFS in the Google MapReduce implementation.

A block in HDFS is equivalent to a chunk in GFS and is also very large, 64 Mb

by default, but 128 Mb is used in some installations. The large block size is

intended to reduce the number of seeks and improve data transfer times.

Each block is an independent unit stored as a dynamically allocated file in

the Linux local file system in a DataNode directory. If the node has mul-

tiple disk drives, multiple DataNode directories can be specified. An addi-

tional local file per block stores metadata for the block. HDFS also follows a

master-slave architecture, which consists of a single master server that man-

ages the distributed file system namespace and regulates access to files by

clients called the NameNode. In addition, there are multiple DataNodes, one

per node in the cluster, which manage the disk storage attached to the nodes

and assigned to Hadoop. The NameNode determines the mapping of blocks

to DataNodes. The DataNodes are responsible for serving read and write

requests from file system clients such as MapReduce tasks, and they also

perform block creation, deletion, and replication based on commands from

the NameNode.

HDFS is a file system, and like any other file system architecture, it needs

to manage consistency, recoverability, and concurrency for reliable opera-

tions. These requirements have been addressed in the architecture by creat-

ing image, journal, and checkpoint files.

17.3.1.1 HDFS Architecture

1. NameNode (master node) : The NameNode is a single master server

that manages the file system namespace and regulates access to

files by clients. Additionally, the NameNode manages all the

operations like opening, closing, moving, naming, and renaming

of files and directories. It also manages the mapping of blocks to

DataNodes.

2. DataNodes (slave nodes) : DataNodes represent the slaves in the

architecture that manage data and the storage attached to the data.

A typical HDFS cluster can have thousands of DataNodes and tens

Search WWH ::

Custom Search

Home