Effective Big Data ETL with SSIS, Pig, and Sqoop - Microsoft Big Data Solutions

Database Reference

In-Depth Information

When you need to operate on columns, you can use the FOREACH function.

It is used when working with data like that shown here, because it runs the

associated function for each value in the specified column. If you want to

produce an average totalpurchaseamount for each city, you can use the

following statement:

averaged = FOREACH grouped GENERATE group,

AVG(filtered.totalpurchaseamount);

To order the results, you can use the ORDER function. In this case, the $2

indicates that the statement is using the ordinal column position, rather

than addressing it by name:

ordered = ORDER averaged BY $2 DESC;

To store the results, you can call the STORE function. This lets you write the

values back to Hadoop using the PigStorage() functionality:

STORE ordered INTO 'c:\SampleData\PigOutput.txt' USING

PigStorage();

If you take this entire set of statements together, you can see that Pig Latin

is relatively easy to read and understand. These statements could be saved

to a file as a Pig script and then executed as a batch file:

source = LOAD '/MsBigData/Customer/' USING PigStorage()

AS (name, city, state,

postalcode, totalpurchaseamount);

filtered = FILTER source BY state = 'FL';

grouped = GROUP filtered BY city;

averaged = FOREACH grouped GENERATE group,

AVG(filtered.totalpurchaseamount);

ordered = ORDER averaged BY $2 DESC;

STORE ordered INTO 'c:\SampleData\PigOutput.txt' USING

PigStorage();

Search WWH ::

Custom Search

Home