Database Reference
In-Depth Information
field is a constant field (every row in B has the same third field) with the chararray
value Constant .
The FOREACH...GENERATE operator has a nested form to support more complex pro-
cessing. In the following example, we compute various statistics for the weather dataset:
-- year_stats.pig
REGISTER pig -examples.jar;
DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
records = LOAD 'input/ncdc/all/19{1,2,3,4,5}0*'
USING
com.hadoopbook.pig.CutLoadFunc( '5-10,11-15,16-19,88-92,93-93' )
AS (usaf: chararray , wban: chararray , year: int , temperature: int ,
quality: int );
grouped_records = GROUP records BY year PARALLEL 30 ;
year_stats = FOREACH grouped_records {
uniq_stations = DISTINCT records.usaf;
good_records = FILTER records BY isGood (quality);
GENERATE FLATTEN ( group ), COUNT (uniq_stations) AS station_count,
COUNT (good_records) AS good_record_count, COUNT (records) AS
record_count;
}
DUMP year_stats;
Using the cut UDF we developed earlier, we load various fields from the input dataset into
the records relation. Next, we group records by year. Notice the PARALLEL
keyword for setting the number of reducers to use; this is vital when running on a cluster.
Then we process each group using a nested FOREACH...GENERATE operator. The first
nested statement creates a relation for the distinct USAF identifiers for stations using the
DISTINCT operator. The second nested statement creates a relation for the records with
“good” readings using the FILTER operator and a UDF. The final nested statement is a
GENERATE statement (a nested FOREACH...GENERATE must always have a
GENERATE statement as the last nested statement) that generates the summary fields of
interest using the grouped records, as well as the relations created in the nested block.
Running it on a few years' worth of data, we get the following:
(1920,8L,8595L,8595L)
(1950,1988L,8635452L,8641353L)
(1930,121L,89245L,89262L)
(1910,7L,7650L,7650L)
(1940,732L,1052333L,1052976L)
Search WWH ::




Custom Search