Database Reference
In-Depth Information
grunt> DESCRIBE records;
records: {year: bytearray,temperature: bytearray,quality: bytearray}
In this case, we have specified only the names of the fields in the schema: year , tem-
perature , and quality . The types default to bytearray , the most general type,
representing a binary string.
You don't need to specify types for every field; you can leave some to default to
bytearray , as we have done for year in this declaration:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: int,quality: int}
However, if you specify a schema in this way, you do need to specify every field. Also,
there's no way to specify the type of a field without specifying the name. On the other
hand, the schema is entirely optional and can be omitted by not specifying an AS clause:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';
grunt> DESCRIBE records;
Schema for records unknown.
Fields in a relation with no schema can be referenced using only positional notation: $0
refers to the first field in a relation, $1 to the second, and so on. Their types default to
bytearray :
grunt> projected_records = FOREACH records GENERATE $0, $1, $2;
grunt> DUMP projected_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
grunt> DESCRIBE projected_records;
projected_records: {bytearray,bytearray,bytearray}
Although it can be convenient not to assign types to fields (particularly in the first stages
of writing a query), doing so can improve the clarity and efficiency of Pig Latin programs
and is generally recommended.
Using Hive tables with HCatalog
Declaring a schema as a part of the query is flexible but doesn't lend itself to schema re-
use. A set of Pig queries over the same input data will often have the same schema re-
Search WWH ::




Custom Search