Database Reference
In-Depth Information
Listing 9.2 Filtering and transformation steps in a Pig Workflow
# Filtering by field value
grunt> WEBPURCHASES = LOAD 'webpurchases.csv' USING PigStorage(',') AS
(user_id:chararray, item:chararray, date:chararray, price:double);
grunt> DUMP WEBPURCHASES;
(1,Item 2,2012-01-14,2.99)
(7,Item 3,2012-01-01,1.99)
(19,Item 2,2012-01-03,2.99)
(19,Item 1,2012-01-14,4.99)
(19,Item 2,2012-01-09,2.99)
# etc ...
grunt> JAN_03_2012_PURCHASES = FILTER WEBPURCHASES BY date == '2012-01-03';
grunt> DUMP JAN_03_2012_PURCHASES;
(19,Item 2,2012-01-03,2.99)
Running a Pig Script in Batch Mode
Now that we've tested our basic workf low using the Grunt tool, let's place our Pig
Latin statements into a script that we can run from the command line. Remember to
remove diagnostic operators such as DUMP from your script when running in produc-
tion, as it will slow down the execution of the script. Store the commands in a text
file, and run the script using the pig command (see Listing 9.3).
Listing 9.3 Run a Pig script with a Hadoop cluster (and data in HDFS)
# Pig commands in a file called "my_workflow.pig"
# Move the local files into HDFS
> hadoop dfs -put users.csv
> hadoop dfs -put webpurchases.csv
> pig my_workflow.pig
Cascading: Building Robust Data-Workflow
Applications
Scripting a workf low using Pig is one approach to building a complex data workf low—
but actually building, testing, and shipping a robust software product is another thing
altogether. Although Pig has many features that allow it to be integrated into existing
applications and even unit-tested, it may make more sense to express workf lows using a
language that can use a more robust development environment such as Java.
 
 
 
Search WWH ::




Custom Search