Database Reference
In-Depth Information
Chapter 4
HDFS, Hive, HBase, and HCatalog
What You Will Learn in This Chapter
• Exploring HDFS
• Working with Hive
• Understanding HBase and HCatalog
One of the key pieces of the Hadoop big data platform is the file system.
Functioning as the backbone, it is used to store and later retrieve your data,
making it available to consumers for a multitude of tasks, including data
processing.
Unlike the file system found on your desktop computer or laptop, where
drives are typically measured in gigabytes, the Hadoop Distributed File
System (HDFS) must be capable of storing files where each file can be of
gigabyte or terabyte sizes. This presents a series of unique challenges that
must be overcome.
This chapter discusses the HDFS, its architecture, and how it solves many of
the hurdles, such as reliably storing your big data, efficient access, and other
tasks like replicating data throughout your cluster. We will also look at Hive,
HBase, and HCatalog, all platforms or tools available within the Hadoop
ecosystem that help simplify the management and subsequent retrieval of
data out of HDFS.
Exploring the Hadoop Distributed File System
OriginallycreatedaspartofawebsearchengineprojectcalledApacheNutch,
HDFSisadistributed filesystemdesigned torunonaclusterofcost-effective
commodity hardware. Although there are a number of distributed file
systemsinthemarketplace,severalnotablecharacteristics makeHDFSreally
stand out. These characteristics align with the overalls goals as defined by the
HDFS team and are enumerated here:
Fault tolerance : Instead of assuming that hardware failure is rare,
HDFS assumes that failures are instead the norm. To this end, an HDFS
instance consists of multiple machines or servers that each stores part of
Search WWH ::




Custom Search