HDFS, Hive, HBase, and HCatalog - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Chapter 4

HDFS, Hive, HBase, and HCatalog

What You Will Learn in This Chapter

• Exploring HDFS

• Working with Hive

• Understanding HBase and HCatalog

One of the key pieces of the Hadoop big data platform is the file system.

Functioning as the backbone, it is used to store and later retrieve your data,

making it available to consumers for a multitude of tasks, including data

processing.

Unlike the file system found on your desktop computer or laptop, where

drives are typically measured in gigabytes, the Hadoop Distributed File

System (HDFS) must be capable of storing files where each file can be of

gigabyte or terabyte sizes. This presents a series of unique challenges that

must be overcome.

This chapter discusses the HDFS, its architecture, and how it solves many of

the hurdles, such as reliably storing your big data, efficient access, and other

tasks like replicating data throughout your cluster. We will also look at Hive,

HBase, and HCatalog, all platforms or tools available within the Hadoop

ecosystem that help simplify the management and subsequent retrieval of

data out of HDFS.

Exploring the Hadoop Distributed File System

OriginallycreatedaspartofawebsearchengineprojectcalledApacheNutch,

HDFSisadistributed filesystemdesigned torunonaclusterofcost-effective

commodity hardware. Although there are a number of distributed file

systemsinthemarketplace,severalnotablecharacteristics makeHDFSreally

stand out. These characteristics align with the overalls goals as defined by the

HDFS team and are enumerated here:

• Fault tolerance : Instead of assuming that hardware failure is rare,

HDFS assumes that failures are instead the norm. To this end, an HDFS

instance consists of multiple machines or servers that each stores part of

Search WWH ::

Custom Search

Home