Database Reference
In-Depth Information
-- max_temp_filter_stream.pig
DEFINE
is_good_quality `is_good_quality.py`
SHIP
(
'ch16-pig/src/main/python/is_good_quality.py'
);
records =
LOAD
'input/ncdc/micro-tab/sample.txt'
AS
(year:
chararray
, temperature:
int
, quality:
int
);
filtered_records =
STREAM
records
THROUGH
is_good_quality
AS
(year:
chararray
, temperature:
int
);
grouped_records =
GROUP
filtered_records
BY
year;
max_temp =
FOREACH
grouped_records
GENERATE group
,
MAX
(filtered_records.temperature);
DUMP
max_temp;
Grouping and Joining Data
Joining datasets in MapReduce takes some work on the part of the programmer (see
more approachable. Since the large datasets that are suitable for analysis by Pig (and
MapReduce in general) are not normalized, however, joins are used more infrequently in
Pig than they are in SQL.
JOIN
Let's look at an example of an inner join. Consider the relations
A
and
B
:
grunt>
DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt>
DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt>
C = JOIN A BY $0, B BY $1;
grunt>
DUMP C;
(2,Tie,Hank,2)
(2,Tie,Joe,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)