Database Reference
In-Depth Information
Relations are given names, or
aliases
, so they can be referred to. This relation is given the
records
alias. We can examine the contents of an alias using the
DUMP
operator:
grunt>
DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
We can also see the structure of a relation — the relation's
schema
— using the
DESCRIBE
operator on the relation's alias:
grunt>
DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
This tells us that
records
has three fields, with aliases
year
,
temperature
, and
quality
, which are the names we gave them in the
AS
clause. The fields have the types
given to them in the
AS
clause, too. We examine types in Pig in more detail later.
The second statement removes records that have a missing temperature (indicated by a
value of 9999) or an unsatisfactory quality reading. For this small dataset, no records are
filtered out:
grunt>
filtered_records = FILTER records BY temperature != 9999 AND
>>
quality IN (0, 1, 4, 5, 9);
grunt>
DUMP filtered_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
The third statement uses the
GROUP
function to group the
records
relation by the
year
field. Let's use
DUMP
to see what it produces:
grunt>
grouped_records = GROUP filtered_records BY year;
grunt>
DUMP grouped_records;
(1949,{(1949,78,1),(1949,111,1)})
(1950,{(1950,-11,1),(1950,22,1),(1950,0,1)})
We now have two rows, or tuples: one for each year in the input data. The first field in
each tuple is the field being grouped by (the year), and the second field has a bag of tuples
for that year. A
bag
is just an unordered collection of tuples, which in Pig Latin is repres-
ented using curly braces.