Extending Pipe Assemblies - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

$ pig -version

Warning: $HADOOP_HOME is deprecated.

Apache Pig version 0.10.0 ( r1328203 )

compiled Apr 19 2012, 22:54:12

$ pig -p docPath = ./data/rain.txt -p wcPath = ./output/wc -p \

stopPath = ./data/en.stop ./src/scripts/wc.pig

Output from this Pig script should be the same as the output from the Cascading sample

app. To be fair, Pig has support for a replicated join, which is not shown here. We tried

to get it working, but there were bugs.

Notice that the Pig source is reasonably similar to Cascading, and even a bit more com‐

pact. There are sources and sinks defined, tuple schemes, pipe assemblies, joins, func‐

tions, regex filters, aggregations, etc. Also, the EXPLAIN at the last line generates a flow

diagram, which will be in the dot/wc_pig.dot file after the script runs.

Apache Pig is a data manipulation language (DML) , which provides a query algebra

atop Hadoop. It is easy to pick up and generally considered to have less of a learning

curve when compared with Cascading—especially for people who are analysts, not

J2EE developers. An interactive prompt called Grunt makes it simple to prototype apps.

Also, Pig can be extended by writing user-defined functions in Java or other languages.

Some drawbacks may be encountered when using Pig for complex apps, particularly in

Enterprise IT environments. Extensions via UDFs must be coded and built outside of

the Pig Latin language. Similarly, integration of apps outside the context of Apache

Hadoop generally requires other coding outside of the scripting language. Business logic

must cross multiple language boundaries. This makes it increasingly difficult to trou‐

bleshoot code, optimize query plans, audit schema use, handle exceptions, set notifi‐

cations, track data provenance, etc.

Also note that the LOAD and STORE statements use string literals to reference command-

line arguments. These are analogous to taps in Cascading, except that in Pig the compiler

won't be able to catch errors until runtime—which is problematic given that potentially

expensive resources on the cluster are already being consumed. Using string literals for

business logic tends to limit testability in general.

Another issue is much more nuanced: in Pig, the logical plan for a query is conflated

with its physical plan. This implies a nondeterministic aspect to Pig's executions, because

the number of maps and reduces may change unexpectedly as the data changes. This

limits the ability to collect app history in “apples-to-apples” comparisons across different

runs as your production data changes.

In short, simple problems are simple to do in Pig; hard problems become quite complex.

For organizations that tend toward the “conservatism” end of a spectrum for program‐

ming environments, these issues with Pig increase risk at scale. Yahoo! has been able to

Search WWH ::

Custom Search

Home