Database Reference
In-Depth Information
Relations are given names, or aliases , so they can be referred to. This relation is given the
records alias. We can examine the contents of an alias using the DUMP operator:
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
We can also see the structure of a relation — the relation's schema — using the
DESCRIBE operator on the relation's alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
This tells us that records has three fields, with aliases year , temperature , and
quality , which are the names we gave them in the AS clause. The fields have the types
given to them in the AS clause, too. We examine types in Pig in more detail later.
The second statement removes records that have a missing temperature (indicated by a
value of 9999) or an unsatisfactory quality reading. For this small dataset, no records are
filtered out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> quality IN (0, 1, 4, 5, 9);
grunt> DUMP filtered_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
The third statement uses the GROUP function to group the records relation by the
year field. Let's use DUMP to see what it produces:
grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,78,1),(1949,111,1)})
(1950,{(1950,-11,1),(1950,22,1),(1950,0,1)})
We now have two rows, or tuples: one for each year in the input data. The first field in
each tuple is the field being grouped by (the year), and the second field has a bag of tuples
for that year. A bag is just an unordered collection of tuples, which in Pig Latin is repres-
ented using curly braces.
Search WWH ::




Custom Search