Database Reference
In-Depth Information
Our iislog table can now be referenced directly one or more times in Hive
by simply using the table name, as seen previously. Because HCatalog is
integrated across platforms, the same table can also be referenced in a Pig
job.
Let's look first at an example of a simple Pig Latin script that references the
data location directly and includes the column schema definition:
A = load '/data/logs' using PigStorage() as
(date:chararray,
time:chararray, username:chararray, ip:chararray,
port:int, method:chararray, uristem:chararray,
uriquery:chararray, timetaken:int,
useragent:chararry, referrer:chararray);
You can compare and contrast the code samples to see how HCatalog
simplifies the process by removing both the data storage location and
schema from the script:
A = load 'iislog' using HCatLoader();
In this example, if the underlying data structure were to change and the
location of the logs were moved from the /data/logs path to /archive/
2013/weblogs , the HCatalog metadata could be updated using the ALTER
statement. This allows all the Hive script, MapReduce, and Pig jobs that are
using the HCatalog to continue to run without modification:
ALTER EXTERNAL TABLE iislog
LOCATION '/archive/2013/weblogs ';
Together, these features allow Hive to look and act like a database or data
warehouse over your big data. In the next section, we will explore a different
implementation that provides a No-SQL database on top of HDFS.
Exploring HBase: An HDFS Column-oriented
Database
So far, all the techniques presented in this chapter have a similar use case.
They are all efficient at simplifying access to your big data, but they have
all largely been focused on batch-centric operations and are mostly
Search WWH ::




Custom Search