Analytic Helpers - Field Guide to Hadoop

Database Reference

In-Depth Information

it speeds up your coding and makes it much more maintainable. Pig calls itself a data flow

language in which datasets are read in and transformed to other datasets using a combination

of procedural thinking as well as some SQL-like constructs.

Pig is so called because “pigs eat everything,” meaning that Pig can accommodate many dif-

ferent forms of input, but is frequently used for transforming text datasets. In many ways, Pig

is an admirable extract, transform, and load (ETL) tool. Pig is translated or compiled into

MapReduce code and it is reasonably well optimized so that a series of Pig statements do not

generate mappers and reducers for each statement and then run them sequentially.

There is a library of shared Pig routines available in the Piggy Bank .

Tutorial Links

There's a fairly complete guide to get you through the process of installing Pig and writing

your first couple scripts. “Working with Pig” is a great overview of the Pig technology.

Example Code

The movie review problem can be expressed quickly in Pig with only five lines of code:

-- Read in all the movie review and find the average rating

for the film Dune

-- the file reviews.csv has lines of form:

name, film_title, rating

reviews = load 'reviews.csv' using PigStorage(',')

as (reviewer:chararray, title:chararray,rating:int);

-- Only consider reviews of Dune

duneonly = filter reviews by title == 'Dune';

-- we want to use the Pig builtin AVG function but

-- AVG works on bags, not lists, this creates bags

dunebag = group duneonly by title;

-- now generate the average and then dump it

dunescore = foreach dunebag generate AVG(dune.rating);

dump dunescore;

Search WWH ::

Custom Search

Home