Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

time (in parallel). Pig provides an abstraction that allows users of Hadoop to express

parallel MapReduce tasks as a series of high-level, step-by-step tasks. Pig can be

thought of as having a procedural model, which tends to match the mental model of how

we think about data workf lows. This allows users to express the details without wor-

rying about how the sausage gets made, so to speak.

Although a plain MapReduce job passes data around as a series of key-value pairs,

Pig abstracts data workf lows by approaching individual records as tuples , or an

ordered list of fields. A collection of Pig tuples is referred to as a bag . Once source

data is split up into a collection of tuples and loaded into a bag, it becomes known as

a “relation.” This abstraction makes it easy to write procedural commands. In other

words, Pig allows you to think of a data transformation at a very human level. “Take

this bag of data, process it, and place the results in this other bag.”

Hadoop is designed to be used across a distributed network of many machines.

When testing data-workf low scripts, running across a live cluster can create an addi-

tional level of complexity. Like other Hadoop-based tools, it's possible to run Pig on

a single machine, or in local mode. It's always a good idea to test data transformation

scripts locally before attempting a larger deployment as Pig will run the same way

whether run locally or on a cluster.

Pig aims eventually to be supported by a number of frameworks, but for now

installing Pig requires a recent version of Java and at least a single machine running

Hadoop. For writing and testing Pig scripts on a workstation, it is possible to run

Hadoop in “local” mode.

Running Pig Using the Interactive Grunt Shell

Pig commands can be run using the built-in interactive shell called Grunt . Grunt is

useful for testing the individual steps in a Pig workf low and for displaying the results

of each step at different points in the process. The Grunt shell can be invoked by

typing pig on the command line, but by default the shell will assume that you want

to run jobs on your Hadoop cluster and that your input data is in HDFS. To run

Grunt on a local machine using input files on the local filesystem, use the -x flag (see

Listing 9.1).

To demonstrate the basics of using Pig, let's first create a workf low that joins data

from two CSV files by means of a particular key. Our examples use the diagnostic

command DUMP to show the result of our workflow at given intervals. Although use-

ful for debugging, it's not a good idea to use the DUMP command in a production script

as it will reduce performance and prevent certain types of optimizations from being

used. Also, when testing workf low scripts locally, don't forget to start with small sam-

ples of data rather than a full, massive dataset.

Pig loads data using the PigStorage module. By default, PigStorage considers sepa-

rate fields in a record as delimited by a tab separator. In order to load data with a dif-

ferent delimiter (such as a comma), pass the desired delimiter character to PigStorage

during loading.

Search WWH ::

Custom Search

Home