Database Reference
In-Depth Information
peated in each query. If the query processes a large number of fields, this repetition can
become hard to maintain.
HCatalog (which is a component of Hive) solves this problem by providing access to
Hive's metastore, so that Pig queries can reference schemas by name, rather than specify-
ing them in full each time. For example, after running through
An Example
to load data
into a Hive table called
records
, Pig can access the table's schema and data as follows:
%
pig -useHCatalog
grunt>
records = LOAD 'records' USING
org.apache.hcatalog.pig.HCatLoader();
grunt>
DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
grunt>
DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
Validation and nulls
A SQL database will enforce the constraints in a table's schema at load time; for example,
trying to load a string into a column that is declared to be a numeric type will fail. In Pig,
if the value cannot be cast to the type declared in the schema, it will substitute a
null
value. Let's see how this works when we have the following input for the weather data,
which has an āeā character in place of an integer:
1950 0 1
1950 22 1
1950 e 1
1949 111 1
1949 78 1
Pig handles the corrupt line by producing a
null
for the offending value, which is dis-
played as the absence of a value when dumped to screen (and also when saved using
STORE
):
grunt>
records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>>
AS (year:chararray, temperature:int, quality:int);
grunt>
DUMP records;
(1950,0,1)
(1950,22,1)
(1950,,1)