Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

An Example

Let's look at a simple example by writing the program to calculate the maximum recorded

temperature by year for the weather dataset in Pig Latin (just like we did using MapReduce

in Chapter 2 ). The complete program is only a few lines long:

-- max_temp.pig: Finds the maximum temperature by year

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year: chararray , temperature: int , quality: int );

filtered_records = FILTER records BY temperature != 9999 AND

quality IN ( 0 , 1 , 4 , 5 , 9 );

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group ,

MAX (filtered_records.temperature);

DUMP max_temp;

To explore what's going on, we'll use Pig's Grunt interpreter, which allows us to enter lines

and interact with the program to understand what it's doing. Start up Grunt in local mode,

and then enter the first line of the Pig script:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>>

AS (year:chararray, temperature:int, quality:int);

For simplicity, the program assumes that the input is tab-delimited text, with each line hav-

ing just year, temperature, and quality fields. (Pig actually has more flexibility than this

with regard to the input formats it accepts, as we'll see later.) This line describes the input

data we want to process. The year:chararray notation describes the field's name and

type; chararray is like a Java String , and an int is like a Java int . The LOAD oper-

ator takes a URI argument; here we are just using a local file, but we could refer to an

HDFS URI. The AS clause (which is optional) gives the fields names to make it convenient

to refer to them in subsequent statements.

The result of the LOAD operator, and indeed any operator in Pig Latin, is a relation , which

is just a set of tuples. A tuple is just like a row of data in a database table, with multiple

fields in a particular order. In this example, the LOAD function produces a set of (year, tem-

perature, quality) tuples that are present in the input file. We write a relation with one tuple

per line, where tuples are represented as comma-separated items in parentheses:

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

Search WWH ::

Custom Search

Home