Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

there are a few restrictions that can trip up the uninitiated. For example, it's not possible to

create a relation from a bag literal. So, the following statement fails:

A = {(1,2),(3,4)}; -- Error

The simplest workaround in this case is to load the data from a file using the LOAD state-

ment.

As another example, you can't treat a relation like a bag and project a field into a new re-

lation ( $0 refers to the first field of A , using the positional notation):

B = A.$0;

Instead, you have to use a relational operator to turn the relation A into relation B :

B = FOREACH A GENERATE $0;

It's possible that a future version of Pig Latin will remove these inconsistencies and treat

relations and bags in the same way.

Schemas

A relation in Pig may have an associated schema, which gives the fields in the relation

names and types. We've seen how an AS clause in a LOAD statement is used to attach a

schema to a relation:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year:int, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: int,temperature: int,quality: int}

This time we've declared the year to be an integer rather than a chararray , even

though the file it is being loaded from is the same. An integer may be more appropriate if

we need to manipulate the year arithmetically (to turn it into a timestamp, for example),

whereas the chararray representation might be more appropriate when it's being used

as a simple identifier. Pig's flexibility in the degree to which schemas are declared con-

trasts with schemas in traditional SQL databases, which are declared before the data is

loaded into the system. Pig is designed for analyzing plain input files with no associated

type information, so it is quite natural to choose types for fields later than you would with

an RDBMS.

It's possible to omit type declarations completely, too:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>>

AS (year, temperature, quality);

Search WWH ::

Custom Search

Home