Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

By grouping the data in this way, we have created a row per year, so now all that remains

is to find the maximum temperature for the tuples in each bag. Before we do this, let's un-

derstand the structure of the grouped_records relation:

grunt> DESCRIBE grouped_records;

grouped_records: {group: chararray,filtered_records: {year:

chararray,

temperature: int,quality: int}}

This tells us that the grouping field is given the alias group by Pig, and the second field

is the same structure as the filtered_records relation that was being grouped. With

this information, we can try the fourth transformation:

grunt> max_temp = FOREACH grouped_records GENERATE group,

>>

MAX(filtered_records.temperature);

FOREACH processes every row to generate a derived set of rows, using a GENERATE

clause to define the fields in each derived row. In this example, the first field is group ,

which is just the year. The second field is a little more complex. The

filtered_records.temperature reference is to the temperature field of the

filtered_records bag in the grouped_records relation. MAX is a built-in func-

tion for calculating the maximum value of fields in a bag. In this case, it calculates the

maximum temperature for the fields in each filtered_records bag. Let's check the

result:

grunt> DUMP max_temp;

(1949,111)

(1950,22)

We've successfully calculated the maximum temperature for each year.

Generating Examples

In this example, we've used a small sample dataset with just a handful of rows to make it

easier to follow the data flow and aid debugging. Creating a cut-down dataset is an art, as

ideally it should be rich enough to cover all the cases to exercise your queries (the com-

pleteness property), yet small enough to make sense to the programmer (the conciseness

property). Using a random sample doesn't work well in general because join and filter op-

erations tend to remove all random data, leaving an empty result, which is not illustrative

of the general data flow.

With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably com-

plete and concise sample dataset. Here is the output from running ILLUSTRATE on our

dataset (slightly reformatted to fit the page):

Search WWH ::

Custom Search

Home