Database Reference
In-Depth Information
columns and a very SQL-like feel). HCatalog is closely associated with Hive because it uses
and derives from the Hive metastore, the place that Hive stores its metadata about its tables.
HCatalog has the notion of partitions. A partition is a subset of rows of a table that have
some common characteristic. Often, tables are partitioned by a date field. This makes it easy
to query and also easy to manage, dropping partitions when they are no longer needed.
If you decide to use HCatalog, you'll access your data via the HCatalog methods rather than
those native to Pig or MapReduce. For example, in Pig, you commonly use PigStorage or
TextLoader to read data, whereas when using HCatalog, you would use HCatLoader and
HCatStorer.
Tutorial Links
HCatalog is one of the more sparsely documented major projects in the Hadoop ecosystem,
but this tutorial from HortonWorks is well done.
Example Code
In Pig without HCatalog, you might load a file using something like:
reviews = load 'reviews.csv' using PigStorage(',')
as (reviewer:chararray, title:chararray,rating:int);
Using HCatalog, you might first create a table within Hive
CREATE TABLE movie_reviews
( reviewer STRING, title STRING, rating INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
and then use it in your Pig statement:
reviews = load 'movie_reviews'
USING org.apache.hcatalog.pig.HCatLoader(); -
Search WWH ::




Custom Search