Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

some tasks, but these scripts can become brittle if input or output parameters change.

What if the requirements of your key-value mappings need to be updated? How many

parts of your code must change? If your code depends on particular machine char-

acteristics, such as the presence of a particular library or module, will the workf low

break when ported from one cluster to another?

When we build data applications in this way, we are expressing our goals in a way

that is a bit more computer than human. One of the core lessons of this topic is to

worry primarily about the data problem rather than the technology. The MapReduce

paradigm is an abstraction that works well for machines. Computers are great at pass-

ing around and keeping track of thousands of individual chunks of data. However, on

behalf of my fellow humans, thinking in terms of the individual steps of a MapReduce

job is not our strongest skill. We likely don't want to worry about each step; we just

want the machines to whir away and coordinate amongst themselves. We would much

rather say “take this collection of data here, get rid of the values I don't care about,

combine it with this other data, and put the result over here.” This is why we need

workf low software: tools that can take higher-level descriptions of data processing

f lows and translate them into distributed MapReduce steps that can be run on frame-

works such as Hadoop.

Apache Pig: “Ixnay on the Omplexitycay”

When it comes to defining data workf lows on Hadoop, the Apache Pig framework is

often the tool of choice. The Apache Pig Web site claims that 40% of all Hadoop jobs

run at Yahoo! are Pig jobs. Pig provides two components: a high-level and simple-to-

learn language for expressing workf lows and a platform to turn these workf lows into

MapReduce jobs. Pig was originally developed at Yahoo!, but it was migrated to the

Apache foundation in 2007.

Pig's syntax is known as Pig Latin. Unlike the Pig Latin sketches from classic

Three Stooges shorts, Apache's Pig Latin is not very useful for obfuscating messages.

In fact, Pig's syntax is incredibly clear. Common verbs such as LOAD , FILTER , JOIN ,

and GROUP are used to define steps in a data workf low.

In some ways, Pig's workf low syntax is somewhat analogous to the use of SQL for

managing and querying relational databases, and there is a bit of overlap with the types

of results that can be produced from SQL queries and Pig statements. However, com-

paring Pig Latin's syntax to that of SQL is not really a fair comparison; the two tools

occupy very different problem domains. SQL allows query writers to declare gener-

ally what type of operation should take place (such as a SELECT or a JOIN ) but not the

implementation details of these actions. Pig allows workf low writers to choose a particu-

lar implementation of each workf low step. SQL generally aims to provide a single query

result from a query, perhaps joining the results of queries from several tables into a single

result set. Pig provides a means to split data streams into multiple parts, filter, and save

the results in multiple locations, making it excellent for extracting and transforming data.

Distributed frameworks such as Hadoop essentially take large data problems, split

these problems into smaller tasks, and attempt to solve many of the tasks at the same

Search WWH ::

Custom Search

Home