Database Reference
In-Depth Information
By grouping the data in this way, we have created a row per year, so now all that remains
is to find the maximum temperature for the tuples in each bag. Before we do this, let's un-
derstand the structure of the grouped_records relation:
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray,filtered_records: {year:
chararray,
temperature: int,quality: int}}
This tells us that the grouping field is given the alias group by Pig, and the second field
is the same structure as the filtered_records relation that was being grouped. With
this information, we can try the fourth transformation:
grunt> max_temp = FOREACH grouped_records GENERATE group,
>>
MAX(filtered_records.temperature);
FOREACH processes every row to generate a derived set of rows, using a GENERATE
clause to define the fields in each derived row. In this example, the first field is group ,
which is just the year. The second field is a little more complex. The
filtered_records.temperature reference is to the temperature field of the
filtered_records bag in the grouped_records relation. MAX is a built-in func-
tion for calculating the maximum value of fields in a bag. In this case, it calculates the
maximum temperature for the fields in each filtered_records bag. Let's check the
result:
grunt> DUMP max_temp;
(1949,111)
(1950,22)
We've successfully calculated the maximum temperature for each year.
Generating Examples
In this example, we've used a small sample dataset with just a handful of rows to make it
easier to follow the data flow and aid debugging. Creating a cut-down dataset is an art, as
ideally it should be rich enough to cover all the cases to exercise your queries (the com-
pleteness property), yet small enough to make sense to the programmer (the conciseness
property). Using a random sample doesn't work well in general because join and filter op-
erations tend to remove all random data, leaving an empty result, which is not illustrative
of the general data flow.
With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably com-
plete and concise sample dataset. Here is the output from running ILLUSTRATE on our
dataset (slightly reformatted to fit the page):
Search WWH ::




Custom Search