Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Relations are given names, or aliases , so they can be referred to. This relation is given the

records alias. We can examine the contents of an alias using the DUMP operator:

grunt> DUMP records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

We can also see the structure of a relation — the relation's schema — using the

DESCRIBE operator on the relation's alias:

grunt> DESCRIBE records;

records: {year: chararray,temperature: int,quality: int}

This tells us that records has three fields, with aliases year , temperature , and

quality , which are the names we gave them in the AS clause. The fields have the types

given to them in the AS clause, too. We examine types in Pig in more detail later.

The second statement removes records that have a missing temperature (indicated by a

value of 9999) or an unsatisfactory quality reading. For this small dataset, no records are

filtered out:

grunt> filtered_records = FILTER records BY temperature != 9999 AND

>> quality IN (0, 1, 4, 5, 9);

grunt> DUMP filtered_records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

The third statement uses the GROUP function to group the records relation by the

year field. Let's use DUMP to see what it produces:

grunt> grouped_records = GROUP filtered_records BY year;

grunt> DUMP grouped_records;

(1949,{(1949,78,1),(1949,111,1)})

(1950,{(1950,-11,1),(1950,22,1),(1950,0,1)})

We now have two rows, or tuples: one for each year in the input data. The first field in

each tuple is the field being grouped by (the year), and the second field has a bag of tuples

for that year. A bag is just an unordered collection of tuples, which in Pig Latin is repres-

ented using curly braces.

Search WWH ::

Custom Search

Home