Database Reference
In-Depth Information
An Example
Let's look at a simple example by writing the program to calculate the maximum recorded
temperature by year for the weather dataset in Pig Latin (just like we did using MapReduce
in Chapter 2 ). The complete program is only a few lines long:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year: chararray , temperature: int , quality: int );
filtered_records = FILTER records BY temperature != 9999 AND
quality IN ( 0 , 1 , 4 , 5 , 9 );
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group ,
MAX (filtered_records.temperature);
DUMP max_temp;
To explore what's going on, we'll use Pig's Grunt interpreter, which allows us to enter lines
and interact with the program to understand what it's doing. Start up Grunt in local mode,
and then enter the first line of the Pig script:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>>
AS (year:chararray, temperature:int, quality:int);
For simplicity, the program assumes that the input is tab-delimited text, with each line hav-
ing just year, temperature, and quality fields. (Pig actually has more flexibility than this
with regard to the input formats it accepts, as we'll see later.) This line describes the input
data we want to process. The year:chararray notation describes the field's name and
type; chararray is like a Java String , and an int is like a Java int . The LOAD oper-
ator takes a URI argument; here we are just using a local file, but we could refer to an
HDFS URI. The AS clause (which is optional) gives the fields names to make it convenient
to refer to them in subsequent statements.
The result of the LOAD operator, and indeed any operator in Pig Latin, is a relation , which
is just a set of tuples. A tuple is just like a row of data in a database table, with multiple
fields in a particular order. In this example, the LOAD function produces a set of (year, tem-
perature, quality) tuples that are present in the input file. We write a relation with one tuple
per line, where tuples are represented as comma-separated items in parentheses:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
Search WWH ::




Custom Search