Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Whenever possible, HDFS attempts to store the blocks for a file on different

machines so the map step can operate on each block of a file in parallel. Also, by

default, HDFS creates three copies of each block across the cluster to provide the

necessary redundancy in case of a failure. If a machine fails, HDFS replicates an

accessible copy of the relevant data blocks to another available machine. HDFS

is also rack aware, which means that it distributes the blocks across several

equipment racks to prevent an entire rack failure from causing a data unavailable

event. Additionally, the three copies of each block allow Hadoop some flexibility

in determining which machine to use for the map step on a particular block.

For example, an idle or underutilized machine that contains a data block to be

processed can be scheduled to process that data block.

To manage the data access, HDFS utilizes three Java daemons (background

processes): NameNode, DataNode, and Secondary NameNode. Running on a

single machine, the NameNode daemon determines and tracks where the various

blocks of a data file are stored. The DataNode daemon manages the data stored

on each machine. If a client application wants to access a particular file stored in

HDFS, the application contacts the NameNode, and the NameNode provides the

application with the locations of the various blocks for that file. The application

then communicates with the appropriate DataNodes to access the file.

Each DataNode periodically builds a report about the blocks stored on the

DataNode and sends the report to the NameNode. If one or more blocks are not

accessible on a DataNode, the NameNode ensures that an accessible copy of an

inaccessible data block is replicated to another machine. For performance reasons,

the NameNode resides in a machine's memory. Because the NameNode is critical

to the operation of HDFS, any unavailability or corruption of the NameNode

results in a data unavailability event on the cluster. Thus, the NameNode is viewed

as a single point of failure in the Hadoop environment [15]. To minimize the chance

of a NameNode failure and to improve performance, the NameNode is typically run

on a dedicated machine.

A third daemon, the Secondary NameNode , provides the capability to perform

some of the NameNode tasks to reduce the load on the NameNode. Such tasks

include updating the file system image with the contents of the file system edit logs.

It is important to note that the Secondary NameNode is not a backup or redundant

NameNode. In the event of a NameNode outage, the NameNode must be restarted

and initialized with the last file system image file and the contents of the edits logs.

The latest versions of Hadoop provide an HDFS High Availability (HA) feature.

This feature enables the use of two NameNodes: one in an active state, and the

Search WWH ::

Custom Search

Home