Management and Monitoring - Field Guide to Hadoop

Database Reference

In-Depth Information

columns and a very SQL-like feel). HCatalog is closely associated with Hive because it uses

and derives from the Hive metastore, the place that Hive stores its metadata about its tables.

HCatalog has the notion of partitions. A partition is a subset of rows of a table that have

some common characteristic. Often, tables are partitioned by a date field. This makes it easy

to query and also easy to manage, dropping partitions when they are no longer needed.

If you decide to use HCatalog, you'll access your data via the HCatalog methods rather than

those native to Pig or MapReduce. For example, in Pig, you commonly use PigStorage or

TextLoader to read data, whereas when using HCatalog, you would use HCatLoader and

HCatStorer.

Tutorial Links

HCatalog is one of the more sparsely documented major projects in the Hadoop ecosystem,

but this tutorial from HortonWorks is well done.

Example Code

In Pig without HCatalog, you might load a file using something like:

reviews = load 'reviews.csv' using PigStorage(',')

as (reviewer:chararray, title:chararray,rating:int);

Using HCatalog, you might first create a table within Hive

CREATE TABLE movie_reviews

( reviewer STRING, title STRING, rating INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '|'

STORED AS TEXTFILE

and then use it in your Pig statement:

reviews = load 'movie_reviews'

USING org.apache.hcatalog.pig.HCatLoader(); -

Search WWH ::

Custom Search

Home