Database Reference
In-Depth Information
time (in parallel). Pig provides an abstraction that allows users of Hadoop to express
parallel MapReduce tasks as a series of high-level, step-by-step tasks. Pig can be
thought of as having a procedural model, which tends to match the mental model of how
we think about data workf lows. This allows users to express the details without wor-
rying about how the sausage gets made, so to speak.
Although a plain MapReduce job passes data around as a series of key-value pairs,
Pig abstracts data workf lows by approaching individual records as tuples , or an
ordered list of fields. A collection of Pig tuples is referred to as a bag . Once source
data is split up into a collection of tuples and loaded into a bag, it becomes known as
a “relation.” This abstraction makes it easy to write procedural commands. In other
words, Pig allows you to think of a data transformation at a very human level. “Take
this bag of data, process it, and place the results in this other bag.”
Hadoop is designed to be used across a distributed network of many machines.
When testing data-workf low scripts, running across a live cluster can create an addi-
tional level of complexity. Like other Hadoop-based tools, it's possible to run Pig on
a single machine, or in local mode. It's always a good idea to test data transformation
scripts locally before attempting a larger deployment as Pig will run the same way
whether run locally or on a cluster.
Pig aims eventually to be supported by a number of frameworks, but for now
installing Pig requires a recent version of Java and at least a single machine running
Hadoop. For writing and testing Pig scripts on a workstation, it is possible to run
Hadoop in “local” mode.
Running Pig Using the Interactive Grunt Shell
Pig commands can be run using the built-in interactive shell called Grunt . Grunt is
useful for testing the individual steps in a Pig workf low and for displaying the results
of each step at different points in the process. The Grunt shell can be invoked by
typing pig on the command line, but by default the shell will assume that you want
to run jobs on your Hadoop cluster and that your input data is in HDFS. To run
Grunt on a local machine using input files on the local filesystem, use the -x flag (see
Listing 9.1).
To demonstrate the basics of using Pig, let's first create a workf low that joins data
from two CSV files by means of a particular key. Our examples use the diagnostic
command DUMP to show the result of our workflow at given intervals. Although use-
ful for debugging, it's not a good idea to use the DUMP command in a production script
as it will reduce performance and prevent certain types of optimizations from being
used. Also, when testing workf low scripts locally, don't forget to start with small sam-
ples of data rather than a full, massive dataset.
Pig loads data using the PigStorage module. By default, PigStorage considers sepa-
rate fields in a record as delimited by a tab separator. In order to load data with a dif-
ferent delimiter (such as a comma), pass the desired delimiter character to PigStorage
during loading.
 
Search WWH ::




Custom Search