Databases Reference
In-Depth Information
$ pig -version
Warning: $HADOOP_HOME is deprecated.
Apache Pig version 0.10.0 ( r1328203 )
compiled Apr 19 2012, 22:54:12
$ pig -p docPath = ./data/rain.txt -p wcPath = ./output/wc -p \
stopPath = ./data/en.stop ./src/scripts/wc.pig
Output from this Pig script should be the same as the output from the Cascading sample
app. To be fair, Pig has support for a replicated join, which is not shown here. We tried
to get it working, but there were bugs.
Notice that the Pig source is reasonably similar to Cascading, and even a bit more com‐
pact. There are sources and sinks defined, tuple schemes, pipe assemblies, joins, func‐
tions, regex filters, aggregations, etc. Also, the EXPLAIN at the last line generates a flow
diagram, which will be in the dot/wc_pig.dot file after the script runs.
Apache Pig is a data manipulation language (DML) , which provides a query algebra
atop Hadoop. It is easy to pick up and generally considered to have less of a learning
curve when compared with Cascading—especially for people who are analysts, not
J2EE developers. An interactive prompt called Grunt makes it simple to prototype apps.
Also, Pig can be extended by writing user-defined functions in Java or other languages.
Some drawbacks may be encountered when using Pig for complex apps, particularly in
Enterprise IT environments. Extensions via UDFs must be coded and built outside of
the Pig Latin language. Similarly, integration of apps outside the context of Apache
Hadoop generally requires other coding outside of the scripting language. Business logic
must cross multiple language boundaries. This makes it increasingly difficult to trou‐
bleshoot code, optimize query plans, audit schema use, handle exceptions, set notifi‐
cations, track data provenance, etc.
Also note that the LOAD and STORE statements use string literals to reference command-
line arguments. These are analogous to taps in Cascading, except that in Pig the compiler
won't be able to catch errors until runtime—which is problematic given that potentially
expensive resources on the cluster are already being consumed. Using string literals for
business logic tends to limit testability in general.
Another issue is much more nuanced: in Pig, the logical plan for a query is conflated
with its physical plan. This implies a nondeterministic aspect to Pig's executions, because
the number of maps and reduces may change unexpectedly as the data changes. This
limits the ability to collect app history in “apples-to-apples” comparisons across different
runs as your production data changes.
In short, simple problems are simple to do in Pig; hard problems become quite complex.
For organizations that tend toward the “conservatism” end of a spectrum for program‐
ming environments, these issues with Pig increase risk at scale. Yahoo! has been able to
Search WWH ::




Custom Search