Database Reference
In-Depth Information
peated in each query. If the query processes a large number of fields, this repetition can
become hard to maintain.
HCatalog (which is a component of Hive) solves this problem by providing access to
Hive's metastore, so that Pig queries can reference schemas by name, rather than specify-
ing them in full each time. For example, after running through An Example to load data
into a Hive table called records , Pig can access the table's schema and data as follows:
% pig -useHCatalog
grunt> records = LOAD 'records' USING
org.apache.hcatalog.pig.HCatLoader();
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
Validation and nulls
A SQL database will enforce the constraints in a table's schema at load time; for example,
trying to load a string into a column that is declared to be a numeric type will fail. In Pig,
if the value cannot be cast to the type declared in the schema, it will substitute a null
value. Let's see how this works when we have the following input for the weather data,
which has an ā€œeā€ character in place of an integer:
1950 0 1
1950 22 1
1950 e 1
1949 111 1
1949 78 1
Pig handles the corrupt line by producing a null for the offending value, which is dis-
played as the absence of a value when dumped to screen (and also when saved using
STORE ):
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,,1)
Search WWH ::




Custom Search