Database Reference
In-Depth Information
-- max_temp_filter_stream.pig
DEFINE is_good_quality `is_good_quality.py`
SHIP ( 'ch16-pig/src/main/python/is_good_quality.py' );
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year: chararray , temperature: int , quality: int );
filtered_records = STREAM records THROUGH is_good_quality
AS (year: chararray , temperature: int );
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group ,
MAX (filtered_records.temperature);
DUMP max_temp;
Grouping and Joining Data
Joining datasets in MapReduce takes some work on the part of the programmer (see
Joins ) , whereas Pig has very good built-in support for join operations, making it much
more approachable. Since the large datasets that are suitable for analysis by Pig (and
MapReduce in general) are not normalized, however, joins are used more infrequently in
Pig than they are in SQL.
JOIN
Let's look at an example of an inner join. Consider the relations A and B :
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Hank,2)
(2,Tie,Joe,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
Search WWH ::




Custom Search