Database Reference
In-Depth Information
field is a constant field (every row in
B
has the same third field) with the
chararray
value
Constant
.
The
FOREACH...GENERATE
operator has a nested form to support more complex pro-
cessing. In the following example, we compute various statistics for the weather dataset:
-- year_stats.pig
REGISTER pig
-examples.jar;
DEFINE
isGood com.hadoopbook.pig.IsGoodQuality();
records =
LOAD
'input/ncdc/all/19{1,2,3,4,5}0*'
USING
com.hadoopbook.pig.CutLoadFunc(
'5-10,11-15,16-19,88-92,93-93'
)
AS
(usaf:
chararray
, wban:
chararray
, year:
int
, temperature:
int
,
quality:
int
);
grouped_records =
GROUP
records
BY
year
PARALLEL
30
;
year_stats =
FOREACH
grouped_records {
uniq_stations =
DISTINCT
records.usaf;
good_records =
FILTER
records
BY
isGood
(quality);
GENERATE FLATTEN
(
group
),
COUNT
(uniq_stations)
AS
station_count,
COUNT
(good_records)
AS
good_record_count,
COUNT
(records)
AS
record_count;
}
DUMP
year_stats;
Using the cut UDF we developed earlier, we load various fields from the input dataset into
the
records
relation. Next, we group
records
by year. Notice the
PARALLEL
keyword for setting the number of reducers to use; this is vital when running on a cluster.
Then we process each group using a nested
FOREACH...GENERATE
operator. The first
nested statement creates a relation for the distinct USAF identifiers for stations using the
DISTINCT
operator. The second nested statement creates a relation for the records with
“good” readings using the
FILTER
operator and a UDF. The final nested statement is a
GENERATE
statement (a nested
FOREACH...GENERATE
must always have a
GENERATE
statement as the last nested statement) that generates the summary fields of
interest using the grouped records, as well as the relations created in the nested block.
Running it on a few years' worth of data, we get the following:
(1920,8L,8595L,8595L)
(1950,1988L,8635452L,8641353L)
(1930,121L,89245L,89262L)
(1910,7L,7650L,7650L)
(1940,732L,1052333L,1052976L)