Information Technology Reference
In-Depth Information
Scalability —Linear scalability at a storage layer is needed to utilize
parallel processing at its optimum level. Designing for 100% linear
scalability.
Fault tolerance —The automatic ability to recover from failure and
complete the processing of data.
Cross-platform compatibility —The ability to integrate across multiple
architecture platforms.
Compute and storage in one environment —Data and computation colo-
cated in the same architecture removing redundant I/O and exces-
sive disk access.
HDFS is analogous to GFS in the Google MapReduce implementation.
A block in HDFS is equivalent to a chunk in GFS and is also very large, 64 Mb
by default, but 128 Mb is used in some installations. The large block size is
intended to reduce the number of seeks and improve data transfer times.
Each block is an independent unit stored as a dynamically allocated file in
the Linux local file system in a DataNode directory. If the node has mul-
tiple disk drives, multiple DataNode directories can be specified. An addi-
tional local file per block stores metadata for the block. HDFS also follows a
master-slave architecture, which consists of a single master server that man-
ages the distributed file system namespace and regulates access to files by
clients called the NameNode. In addition, there are multiple DataNodes, one
per node in the cluster, which manage the disk storage attached to the nodes
and assigned to Hadoop. The NameNode determines the mapping of blocks
to DataNodes. The DataNodes are responsible for serving read and write
requests from file system clients such as MapReduce tasks, and they also
perform block creation, deletion, and replication based on commands from
the NameNode.
HDFS is a file system, and like any other file system architecture, it needs
to manage consistency, recoverability, and concurrency for reliable opera-
tions. These requirements have been addressed in the architecture by creat-
ing image, journal, and checkpoint files.
17.3.1.1 HDFS Architecture
1. NameNode (master node) : The NameNode is a single master server
that manages the file system namespace and regulates access to
files by clients. Additionally, the NameNode manages all the
operations like opening, closing, moving, naming, and renaming
of files and directories. It also manages the mapping of blocks to
DataNodes.
2. DataNodes (slave nodes) : DataNodes represent the slaves in the
architecture that manage data and the storage attached to the data.
A typical HDFS cluster can have thousands of DataNodes and tens
Search WWH ::




Custom Search