Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 16. Pig

Apache Pig raises the level of abstraction for processing large datasets. MapReduce allows

you, as the programmer, to specify a map function followed by a reduce function, but

working out how to fit your data processing into this pattern, which often requires multiple

MapReduce stages, can be a challenge. With Pig, the data structures are much richer, typic-

ally being multivalued and nested, and the transformations you can apply to the data are

much more powerful. They include joins, for example, which are not for the faint of heart

in MapReduce.

Pig is made up of two pieces:

▪ The language used to express data flows, called Pig Latin .

▪ The execution environment to run Pig Latin programs. There are currently two en-

vironments: local execution in a single JVM and distributed execution on a Ha-

doop cluster.

A Pig Latin program is made up of a series of operations, or transformations, that are ap-

plied to the input data to produce output. Taken as a whole, the operations describe a data

flow, which the Pig execution environment translates into an executable representation and

then runs. Under the covers, Pig turns the transformations into a series of MapReduce jobs,

but as a programmer you are mostly unaware of this, which allows you to focus on the data

rather than the nature of the execution.

Pig is a scripting language for exploring large datasets. One criticism of MapReduce is that

the development cycle is very long. Writing the mappers and reducers, compiling and pack-

aging the code, submitting the job(s), and retrieving the results is a time-consuming busi-

ness, and even with Streaming, which removes the compile and package step, the experien-

ce is still involved. Pig's sweet spot is its ability to process terabytes of data in response to

a half-dozen lines of Pig Latin issued from the console. Indeed, it was created at Yahoo! to

make it easier for researchers and engineers to mine the huge datasets there. Pig is very

supportive of a programmer writing a query, since it provides several commands for intro-

specting the data structures in your program as it is written. Even more useful, it can per-

form a sample run on a representative subset of your input data, so you can see whether

there are errors in the processing before unleashing it on the full dataset.

Pig was designed to be extensible. Virtually all parts of the processing path are customiz-

able: loading, storing, filtering, grouping, and joining can all be altered by user-defined

functions (UDFs). These functions operate on Pig's nested data model, so they can integ-

Search WWH ::

Custom Search

Home