HDFS, Hive, HBase, and HCatalog - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Our iislog table can now be referenced directly one or more times in Hive

by simply using the table name, as seen previously. Because HCatalog is

integrated across platforms, the same table can also be referenced in a Pig

job.

Let's look first at an example of a simple Pig Latin script that references the

data location directly and includes the column schema definition:

A = load '/data/logs' using PigStorage() as

(date:chararray,

time:chararray, username:chararray, ip:chararray,

port:int, method:chararray, uristem:chararray,

uriquery:chararray, timetaken:int,

useragent:chararry, referrer:chararray);

You can compare and contrast the code samples to see how HCatalog

simplifies the process by removing both the data storage location and

schema from the script:

A = load 'iislog' using HCatLoader();

In this example, if the underlying data structure were to change and the

location of the logs were moved from the /data/logs path to /archive/

2013/weblogs , the HCatalog metadata could be updated using the ALTER

statement. This allows all the Hive script, MapReduce, and Pig jobs that are

using the HCatalog to continue to run without modification:

ALTER EXTERNAL TABLE iislog

LOCATION '/archive/2013/weblogs ';

Together, these features allow Hive to look and act like a database or data

warehouse over your big data. In the next section, we will explore a different

implementation that provides a No-SQL database on top of HDFS.

Exploring HBase: An HDFS Column-oriented

Database

So far, all the techniques presented in this chapter have a similar use case.

They are all efficient at simplifying access to your big data, but they have

all largely been focused on batch-centric operations and are mostly

Search WWH ::

Custom Search

Home