Database Reference
In-Depth Information
some tasks, but these scripts can become brittle if input or output parameters change.
What if the requirements of your key-value mappings need to be updated? How many
parts of your code must change? If your code depends on particular machine char-
acteristics, such as the presence of a particular library or module, will the workf low
break when ported from one cluster to another?
When we build data applications in this way, we are expressing our goals in a way
that is a bit more computer than human. One of the core lessons of this topic is to
worry primarily about the data problem rather than the technology. The MapReduce
paradigm is an abstraction that works well for machines. Computers are great at pass-
ing around and keeping track of thousands of individual chunks of data. However, on
behalf of my fellow humans, thinking in terms of the individual steps of a MapReduce
job is not our strongest skill. We likely don't want to worry about each step; we just
want the machines to whir away and coordinate amongst themselves. We would much
rather say “take this collection of data here, get rid of the values I don't care about,
combine it with this other data, and put the result over here.” This is why we need
workf low software: tools that can take higher-level descriptions of data processing
f lows and translate them into distributed MapReduce steps that can be run on frame-
works such as Hadoop.
Apache Pig: “Ixnay on the Omplexitycay”
When it comes to defining data workf lows on Hadoop, the Apache Pig framework is
often the tool of choice. The Apache Pig Web site claims that 40% of all Hadoop jobs
run at Yahoo! are Pig jobs. Pig provides two components: a high-level and simple-to-
learn language for expressing workf lows and a platform to turn these workf lows into
MapReduce jobs. Pig was originally developed at Yahoo!, but it was migrated to the
Apache foundation in 2007.
Pig's syntax is known as Pig Latin. Unlike the Pig Latin sketches from classic
Three Stooges shorts, Apache's Pig Latin is not very useful for obfuscating messages.
In fact, Pig's syntax is incredibly clear. Common verbs such as LOAD , FILTER , JOIN ,
and GROUP are used to define steps in a data workf low.
In some ways, Pig's workf low syntax is somewhat analogous to the use of SQL for
managing and querying relational databases, and there is a bit of overlap with the types
of results that can be produced from SQL queries and Pig statements. However, com-
paring Pig Latin's syntax to that of SQL is not really a fair comparison; the two tools
occupy very different problem domains. SQL allows query writers to declare gener-
ally what type of operation should take place (such as a SELECT or a JOIN ) but not the
implementation details of these actions. Pig allows workf low writers to choose a particu-
lar implementation of each workf low step. SQL generally aims to provide a single query
result from a query, perhaps joining the results of queries from several tables into a single
result set. Pig provides a means to split data streams into multiple parts, filter, and save
the results in multiple locations, making it excellent for extracting and transforming data.
Distributed frameworks such as Hadoop essentially take large data problems, split
these problems into smaller tasks, and attempt to solve many of the tasks at the same
 
Search WWH ::




Custom Search