Database Reference
In-Depth Information
Whenever possible, HDFS attempts to store the blocks for a file on different
machines so the map step can operate on each block of a file in parallel. Also, by
default, HDFS creates three copies of each block across the cluster to provide the
necessary redundancy in case of a failure. If a machine fails, HDFS replicates an
accessible copy of the relevant data blocks to another available machine. HDFS
is also rack aware, which means that it distributes the blocks across several
equipment racks to prevent an entire rack failure from causing a data unavailable
event. Additionally, the three copies of each block allow Hadoop some flexibility
in determining which machine to use for the map step on a particular block.
For example, an idle or underutilized machine that contains a data block to be
processed can be scheduled to process that data block.
To manage the data access, HDFS utilizes three Java daemons (background
processes): NameNode, DataNode, and Secondary NameNode. Running on a
single machine, the NameNode daemon determines and tracks where the various
blocks of a data file are stored. The DataNode daemon manages the data stored
on each machine. If a client application wants to access a particular file stored in
HDFS, the application contacts the NameNode, and the NameNode provides the
application with the locations of the various blocks for that file. The application
then communicates with the appropriate DataNodes to access the file.
Each DataNode periodically builds a report about the blocks stored on the
DataNode and sends the report to the NameNode. If one or more blocks are not
accessible on a DataNode, the NameNode ensures that an accessible copy of an
inaccessible data block is replicated to another machine. For performance reasons,
the NameNode resides in a machine's memory. Because the NameNode is critical
to the operation of HDFS, any unavailability or corruption of the NameNode
results in a data unavailability event on the cluster. Thus, the NameNode is viewed
as a single point of failure in the Hadoop environment [15]. To minimize the chance
of a NameNode failure and to improve performance, the NameNode is typically run
on a dedicated machine.
A third daemon, the Secondary NameNode , provides the capability to perform
some of the NameNode tasks to reduce the load on the NameNode. Such tasks
include updating the file system image with the contents of the file system edit logs.
It is important to note that the Secondary NameNode is not a backup or redundant
NameNode. In the event of a NameNode outage, the NameNode must be restarted
and initialized with the last file system image file and the contents of the edits logs.
The latest versions of Hadoop provide an HDFS High Availability (HA) feature.
This feature enables the use of two NameNodes: one in an active state, and the
Search WWH ::




Custom Search