Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

JobTrackers and execute various steps in the MapReduce framework. Many instances

of these services run simultaneously on a collection of physical or virtual nodes in the

cluster.

In order for your application to be executed in parallel, it must be accessible by

every relevant node in the Hadoop cluster. One way to deploy your code is to pro-

vide a copy of your application and any necessary dependencies needed for it to run

to every node. As you can imagine, this can be error prone, is time consuming, and,

worst of all, is plain annoying.

When the hadoop jar command is invoked, your JAR file (along with other

necessary dependencies specified via the -libjars f lag) is copied automatically to all

relevant nodes in the cluster. The lesson here is that tools like Hadoop, Pig, and Cas-

cading are all different layers of abstraction that help us think about distributed systems

in procedural ways.

Like many open-source technologies used in the large-scale data-analytics world, it's

not always clear when to choose Pig over Cascading or over another solution such as

writing Hadoop streaming API scripts. Tools evolve independently from one another,

so the use cases best served by Pig versus Cascading can sometimes overlap, making

decisions about solutions difficult. I generally think of Pig as a workf low tool, whereas

Cascading is better suited as a foundation for building your own workf low applica-

tions. Pig is often the fastest way to run a transformation job.

Analysts who have never written a line of Python or Java should have little trouble

learning how to write their own Pig scripts. A one-time complex transforming job

should certainly use Pig whenever possible; the small amount of code necessary to

complete the task is hard to beat.

One of Cascading's biggest strengths is that it provides an abstraction model that

allows for a great deal of modularity. Another advantage of using Cascading is that, as

a Java Virtual Machine ( JVM)-based API, it can use all of the rich tools and frame-

works in the Java ecosystem.

Summary

Pig and Cascading are two very different open-source tools for building complex

data workf lows that run on Hadoop. Pig is a data processing platform that provides

an easy-to-use syntax for defining procedural workf low steps. Cascading is a well-

designed and popular data processing API for building robust workf low applications.

Cascading simplifies data using a metaphor that equates data to water: sources, sinks,

taps, and pipes. Data streams can cascade . Thus Cascading is useful for building, test-

ing, and deploying robust data applications. Because Cascading uses the Java Virtual

Machine, it has also become the basis for data-application APIs in other languages that

run on the JVM.

Search WWH ::

Custom Search

Home