HDFS, Hive, HBase, and HCatalog - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Table 4.2 WebHDFS Access Commands

File system

Command

WebHDFS Equivalent

mkdir

PUT "http://<HOST>:<PORT>/

<PATH>?op=MKDIRS"

rm

DELETE "http://<host>:<port>/webhdfs/v1/

<path>?op=DELETE"

ls

"http://<HOST>:<PORT>/webhdfs/v1/

<PATH>?op=LISTSTATUS"

Now that you are familiar with the basic concepts behind HDFS, let's look at

some of the other functionality that is built on top of HDFS.

Exploring Hive: The Hadoop Data Warehouse

Platform

Within the Hadoop ecosystem, HDFS can load and store massive quantities

of data in an efficient and reliable manner. It can also serve that same data

back up to client applications, such as MapReduce jobs, for processing and

data analysis.

Although this is a productive and workable paradigm with a developer's

background, it doesn't do much for an analyst or data scientist trying to sort

through potentially large sets of data, as was the case with Facebook.

Hive, often considered the Hadoop data warehouse platform, got its start at

Facebook as their analyst struggled to deal with the massive quantities of

data produced by the social network. Requiring analysts to learn and write

MapReduce jobs was neither productive nor practical.

Instead, Facebook developed a data warehouse-like layer of abstraction that

would be based on tables. The tables function merely as metadata, and

the table schema is projected onto the data, instead of actually moving

potentially massive sets of data. This new capability allowed their analyst

to use a SQL-like language called Hive Query Language (HQL) to query

massive data sets stored with HDFS and to perform both simple and

sophisticated summarizations and data analysis.

Search WWH ::

Custom Search

Home