Database Reference
In-Depth Information
it speeds up your coding and makes it much more maintainable. Pig calls itself a data flow
language in which datasets are read in and transformed to other datasets using a combination
of procedural thinking as well as some SQL-like constructs.
Pig is so called because “pigs eat everything,” meaning that Pig can accommodate many dif-
ferent forms of input, but is frequently used for transforming text datasets. In many ways, Pig
is an admirable extract, transform, and load (ETL) tool. Pig is translated or compiled into
MapReduce code and it is reasonably well optimized so that a series of Pig statements do not
generate mappers and reducers for each statement and then run them sequentially.
There is a library of shared Pig routines available in the Piggy Bank .
Tutorial Links
There's a fairly complete guide to get you through the process of installing Pig and writing
your first couple scripts. “Working with Pig” is a great overview of the Pig technology.
Example Code
The movie review problem can be expressed quickly in Pig with only five lines of code:
-- Read in all the movie review and find the average rating
for the film Dune
-- the file reviews.csv has lines of form:
name, film_title, rating
reviews = load 'reviews.csv' using PigStorage(',')
as (reviewer:chararray, title:chararray,rating:int);
-- Only consider reviews of Dune
duneonly = filter reviews by title == 'Dune';
-- we want to use the Pig builtin AVG function but
-- AVG works on bags, not lists, this creates bags
dunebag = group duneonly by title;
-- now generate the average and then dump it
dunescore = foreach dunebag generate AVG(dune.rating);
dump dunescore;
Search WWH ::




Custom Search