Database Reference
In-Depth Information
fundamental to the operation of Hadoop. Similarly, MapReduce currently
provides both the scheduling and the execution and programming engines
to the whole of Hadoop. Without these two projects there simply is no
Hadoop.
In this next section, we are going to delve a little deeper into these core
Hadoop projects to build up our knowledge of the main building blocks.
Once we've done that, we'll be well placed to move forward with the next
section, which will touch on some of the other projects in the Hadoop
ecosystem.
HDFS
HDFS, one of the core components of Apache Hadoop, stands for Hadoop
Distributed File System. There's no exotic branding to be found here. HDFS
is a Java-based, distributed, fault-tolerant file storage system designed for
distributionacrossanumberofcommodityservers.Theseservershavebeen
configured to operate together as an HDFS cluster . By leveraging a scale-out
model, HDFS ensures that it can support truly massive data volumes at a
low and linear cost point.
Before diving into the details of HDFS, it is worth taking a moment to
discuss the files themselves. Files created in HDFS are made up of a number
of HDFS data blocks or simply HDFS blocks . These blocks are not small.
They are 64MB or more in size, which allows for larger I/O sizes and in turn
greater throughput. Each block is replicated and then distributed across the
machines of the HDFS cluster.
HDFS is built on three core subcomponents:
• NameNode
• DataNode
• Secondary NameNode
Simply put, the NameNode is the “brain.” It is responsible for managing
the file system, and therefore is responsible for allocating directories and
files. The NameNode also manages the blocks , which are present on the
DataNode. There is only one NameNode per HDFS cluster.
The DataNodes are the workers, sometimes known as slaves . The
DataNodesperformthebiddingoftheNameNode.DataNodesexistonevery
machine in the cluster, and they are responsible for offering up the
Search WWH ::




Custom Search