Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

field is a constant field (every row in B has the same third field) with the chararray

value Constant .

The FOREACH...GENERATE operator has a nested form to support more complex pro-

cessing. In the following example, we compute various statistics for the weather dataset:

-- year_stats.pig

REGISTER pig -examples.jar;

DEFINE isGood com.hadoopbook.pig.IsGoodQuality();

records = LOAD 'input/ncdc/all/19{1,2,3,4,5}0*'

USING

com.hadoopbook.pig.CutLoadFunc( '5-10,11-15,16-19,88-92,93-93' )

AS (usaf: chararray , wban: chararray , year: int , temperature: int ,

quality: int );

grouped_records = GROUP records BY year PARALLEL 30 ;

year_stats = FOREACH grouped_records {

uniq_stations = DISTINCT records.usaf;

good_records = FILTER records BY isGood (quality);

GENERATE FLATTEN ( group ), COUNT (uniq_stations) AS station_count,

COUNT (good_records) AS good_record_count, COUNT (records) AS

record_count;

}

DUMP year_stats;

Using the cut UDF we developed earlier, we load various fields from the input dataset into

the records relation. Next, we group records by year. Notice the PARALLEL

keyword for setting the number of reducers to use; this is vital when running on a cluster.

Then we process each group using a nested FOREACH...GENERATE operator. The first

nested statement creates a relation for the distinct USAF identifiers for stations using the

DISTINCT operator. The second nested statement creates a relation for the records with

“good” readings using the FILTER operator and a UDF. The final nested statement is a

GENERATE statement (a nested FOREACH...GENERATE must always have a

GENERATE statement as the last nested statement) that generates the summary fields of

interest using the grouped records, as well as the relations created in the nested block.

Running it on a few years' worth of data, we get the following:

(1920,8L,8595L,8595L)

(1950,1988L,8635452L,8641353L)

(1930,121L,89245L,89262L)

(1910,7L,7650L,7650L)

(1940,732L,1052333L,1052976L)

Search WWH ::

Custom Search

Home