Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

peated in each query. If the query processes a large number of fields, this repetition can

become hard to maintain.

HCatalog (which is a component of Hive) solves this problem by providing access to

Hive's metastore, so that Pig queries can reference schemas by name, rather than specify-

ing them in full each time. For example, after running through An Example to load data

into a Hive table called records , Pig can access the table's schema and data as follows:

% pig -useHCatalog

grunt> records = LOAD 'records' USING

org.apache.hcatalog.pig.HCatLoader();

grunt> DESCRIBE records;

records: {year: chararray,temperature: int,quality: int}

grunt> DUMP records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

Validation and nulls

A SQL database will enforce the constraints in a table's schema at load time; for example,

trying to load a string into a column that is declared to be a numeric type will fail. In Pig,

if the value cannot be cast to the type declared in the schema, it will substitute a null

value. Let's see how this works when we have the following input for the weather data,

which has an “e” character in place of an integer:

1950 0 1

1950 22 1

1950 e 1

1949 111 1

1949 78 1

Pig handles the corrupt line by producing a null for the offending value, which is dis-

played as the absence of a value when dumped to screen (and also when saved using

STORE ):

grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'

>> AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;

(1950,0,1)

(1950,22,1)

(1950,,1)

Search WWH ::

Custom Search

Home