Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

-- max_temp_filter_stream.pig

DEFINE is_good_quality `is_good_quality.py`

SHIP ( 'ch16-pig/src/main/python/is_good_quality.py' );

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year: chararray , temperature: int , quality: int );

filtered_records = STREAM records THROUGH is_good_quality

AS (year: chararray , temperature: int );

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group ,

MAX (filtered_records.temperature);

DUMP max_temp;

Grouping and Joining Data

Joining datasets in MapReduce takes some work on the part of the programmer (see

Joins ) , whereas Pig has very good built-in support for join operations, making it much

more approachable. Since the large datasets that are suitable for analysis by Pig (and

MapReduce in general) are not normalized, however, joins are used more infrequently in

Pig than they are in SQL.

JOIN

Let's look at an example of an inner join. Consider the relations A and B :

grunt> DUMP A;

(2,Tie)

(4,Coat)

(3,Hat)

(1,Scarf)

grunt> DUMP B;

(Joe,2)

(Hank,4)

(Ali,0)

(Eve,3)

(Hank,2)

We can join the two relations on the numerical (identity) field in each:

grunt> C = JOIN A BY $0, B BY $1;

grunt> DUMP C;

(2,Tie,Hank,2)

(2,Tie,Joe,2)

(3,Hat,Eve,3)

(4,Coat,Hank,4)

Search WWH ::

Custom Search

Home