Database Reference
In-Depth Information
Listing 9.1 Load and run Pig commands using the grunt shell
# users.csv
# Information about individual users
# user_id, user_name, date_joined, state
1,Michael Manoochehri,2010-03-12,California
7,Paul Pigstun,2010-03-12,Washington
19,Haddie Doop,2010-03-12,Maine
41,Bar FooData,2010-03-12,New York
# etc...
# webpurchases.csv
# List of online purchases by user id
# user_id, item_name, date, price
1,Item 2,2012-01-14,2.99
7,Item 3,2012-01-01,1.99
19,Item 2,2012-01-03,2.99
19,Item 3,2012-02-20,1.99
41,Item 2,2012-01-14,2.99
# etc...
# Start Pig interactive shell in "local" mode
> pig -x local
grunt> USERS = LOAD 'users.csv' USING PigStorage(',') AS
(user_id:chararray, user_name:chararray, date_joined:chararray,
state:chararray);
grunt> WEBPURCHASES = LOAD 'webpurchases.csv' USING PigStorage(',') AS
(user_id:chararray, item:chararray, date:chararray, price:double);
grunt> WEB_BUYERS = JOIN USERS BY user_id, WEBPURCHASES BY user_id;
grunt> DUMP WEB_BUYERS;
(1,Michael Manoochehri,2010-03-12,California,1,Item 2,2012-01-14,2.99)
(7,Paul Pigstun,2010-03-12,Washington,7,Item 3,2012-01-01,1.99)
(19,Haddie Doop,2010-03-12,Maine,19,Item 2,2012-01-03,2.99)
(19,Haddie Doop,2010-03-12,Maine,19,Item 1,2012-01-14,4.99)
(19,Haddie Doop,2010-03-12,Maine,19,Item 2,2012-01-09,2.99)
(19,Haddie Doop,2010-03-12,Maine,19,Item 3,2012-02-20,1.99)
(41,Bar FooData,2010-03-12,New York,41,Item 2,2012-01-14,2.99)
# etc...
Filtering and Optimizing Data Workflows
When working with large datasets, a common requirement is to split out a smaller set
of records based on a particular parameter. Pig provides a FILTER clause that can be
used to either select or remove records. Listing 9.2 shows an example of using Pig to
return records that match a particular date string.
 
 
Search WWH ::




Custom Search