Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

grunt> DESCRIBE records;

records: {year: bytearray,temperature: bytearray,quality: bytearray}

In this case, we have specified only the names of the fields in the schema: year , tem-

perature , and quality . The types default to bytearray , the most general type,

representing a binary string.

You don't need to specify types for every field; you can leave some to default to

bytearray , as we have done for year in this declaration:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: int,quality: int}

However, if you specify a schema in this way, you do need to specify every field. Also,

there's no way to specify the type of a field without specifying the name. On the other

hand, the schema is entirely optional and can be omitted by not specifying an AS clause:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';

grunt> DESCRIBE records;

Schema for records unknown.

Fields in a relation with no schema can be referenced using only positional notation: $0

refers to the first field in a relation, $1 to the second, and so on. Their types default to

bytearray :

grunt> projected_records = FOREACH records GENERATE $0, $1, $2;

grunt> DUMP projected_records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

grunt> DESCRIBE projected_records;

projected_records: {bytearray,bytearray,bytearray}

Although it can be convenient not to assign types to fields (particularly in the first stages

of writing a query), doing so can improve the clarity and efficiency of Pig Latin programs

and is generally recommended.

Using Hive tables with HCatalog

Declaring a schema as a part of the query is flexible but doesn't lend itself to schema re-

use. A set of Pig queries over the same input data will often have the same schema re-

Search WWH ::

Custom Search

Home