Information Technology Reference
In-Depth Information
of thousands of HDFS clients per cluster since each DataNode may
execute multiple application tasks simultaneously. The DataNodes
are responsible for managing read and write requests from the file
system's clients, block maintenance, and perform replication as
directed by the NameNode. The block management in HDFS is dif-
ferent from a normal file system. The size of the data file equals the
actual length of the block. This means if a block is half full, it needs
only half of the space of the full block on the local drive, thereby opti-
mizing storage space for compactness, and there is no extra space
consumed on the block unlike a regular file system.
3. Image : An image represents the metadata of the namespace (inodes
and lists of blocks). On startup, the NameNode pins the entire
namespace image in memory. The in-memory persistence enables
the NameNode to service multiple client requests concurrently.
4. Journal : The journal represents the modification log of the image in
the local host's native file system. During normal operations, each
client transaction is recorded in the journal, and the journal file is
flushed and synced before the acknowledgment is sent to the cli-
ent. The NameNode upon startup or from a recovery can replay this
journal.
5. Checkpoint : To enable recovery, the persistent record of the image is
also stored in the local host's native files system and is called a check-
point. Once the system starts up, the NameNode never modifies or
updates the checkpoint file. A new checkpoint file can be created
during the next startup, on a restart, or on demand when requested
by the administrator or by the CheckpointNode.
17. 3 . 2 H B a s e
HBase is an open-source, nonrelational, column-oriented, multidimen-
sional, distributed database developed on Google's BigTable architecture.
It is designed with high availability and high performance as drivers to
support storage and processing of large data sets on the Hadoop frame-
work. HBase is not a database in the purist definition of a database. It
provides unlimited scalability and performance and supports certain
features of an ACID-compliant database. HBase is classified as a NoSQL
database due to its architecture and design being closely aligned to Base
(Being Available and Same Everywhere). Why do we need HBase when
the data are stored in the HDFS file system, which is the core data stor-
age layer within Hadoop? HBase is very useful for operations other than
MapReduce execution and operations that are not easy to work with in
HDFS and when you need random access to data. First, it provides a
database-style interface to Hadoop, which enables developers to deploy
Search WWH ::




Custom Search