Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Listing 9.2 Filtering and transformation steps in a Pig Workflow

# Filtering by field value

grunt> WEBPURCHASES = LOAD 'webpurchases.csv' USING PigStorage(',') AS

(user_id:chararray, item:chararray, date:chararray, price:double);

grunt> DUMP WEBPURCHASES;

(1,Item 2,2012-01-14,2.99)

(7,Item 3,2012-01-01,1.99)

(19,Item 2,2012-01-03,2.99)

(19,Item 1,2012-01-14,4.99)

(19,Item 2,2012-01-09,2.99)

# etc ...

grunt> JAN_03_2012_PURCHASES = FILTER WEBPURCHASES BY date == '2012-01-03';

grunt> DUMP JAN_03_2012_PURCHASES;

(19,Item 2,2012-01-03,2.99)

Running a Pig Script in Batch Mode

Now that we've tested our basic workf low using the Grunt tool, let's place our Pig

Latin statements into a script that we can run from the command line. Remember to

remove diagnostic operators such as DUMP from your script when running in produc-

tion, as it will slow down the execution of the script. Store the commands in a text

file, and run the script using the pig command (see Listing 9.3).

Listing 9.3 Run a Pig script with a Hadoop cluster (and data in HDFS)

# Pig commands in a file called "my_workflow.pig"

# Move the local files into HDFS

> hadoop dfs -put users.csv

> hadoop dfs -put webpurchases.csv

> pig my_workflow.pig

Applications

Scripting a workf low using Pig is one approach to building a complex data workf low—

but actually building, testing, and shipping a robust software product is another thing

altogether. Although Pig has many features that allow it to be integrated into existing

applications and even unit-tested, it may make more sense to express workf lows using a

language that can use a more robust development environment such as Java.

Search WWH ::

Custom Search

Home