Database Reference
In-Depth Information
A = LOAD 'input/pig/join/A' ;
B = LOAD 'input/pig/join/B' ;
C = JOIN A BY $0, /* ignored */ B BY $1;
DUMP C;
Pig Latin has a list of keywords that have a special meaning in the language and cannot be
used as identifiers. These include the operators ( LOAD , ILLUSTRATE ), commands ( cat ,
ls ), expressions ( matches , FLATTEN ), and functions ( DIFF , MAX ) — all of which are
covered in the following sections.
Pig Latin has mixed rules on case sensitivity. Operators and commands are not case sens-
itive (to make interactive use more forgiving); however, aliases and function names are
case sensitive.
Statements
As a Pig Latin program is executed, each statement is parsed in turn. If there are syntax
errors or other (semantic) problems, such as undefined aliases, the interpreter will halt and
display an error message. The interpreter builds a logical plan for every relational opera-
tion, which forms the core of a Pig Latin program. The logical plan for the statement is
added to the logical plan for the program so far, and then the interpreter moves on to the
next statement.
It's important to note that no data processing takes place while the logical plan of the pro-
gram is being constructed. For example, consider again the Pig Latin program from the
first example:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year: chararray , temperature: int , quality: int );
filtered_records = FILTER records BY temperature != 9999 AND
quality IN ( 0 , 1 , 4 , 5 , 9 );
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group ,
MAX (filtered_records.temperature);
DUMP max_temp;
When the Pig Latin interpreter sees the first line containing the LOAD statement, it con-
firms that it is syntactically and semantically correct and adds it to the logical plan, but it
does not load the data from the file (or even check whether the file exists). Indeed, where
would it load it? Into memory? Even if it did fit into memory, what would it do with the
data? Perhaps not all the input data is needed (because later statements filter it, for ex-
ample), so it would be pointless to load it. The point is that it makes no sense to start any
processing until the whole flow is defined. Similarly, Pig validates the GROUP and
Search WWH ::




Custom Search