Database Reference
In-Depth Information
JobTrackers and execute various steps in the MapReduce framework. Many instances
of these services run simultaneously on a collection of physical or virtual nodes in the
cluster.
In order for your application to be executed in parallel, it must be accessible by
every relevant node in the Hadoop cluster. One way to deploy your code is to pro-
vide a copy of your application and any necessary dependencies needed for it to run
to every node. As you can imagine, this can be error prone, is time consuming, and,
worst of all, is plain annoying.
When the hadoop jar command is invoked, your JAR file (along with other
necessary dependencies specified via the -libjars f lag) is copied automatically to all
relevant nodes in the cluster. The lesson here is that tools like Hadoop, Pig, and Cas-
cading are all different layers of abstraction that help us think about distributed systems
in procedural ways.
When to Choose Pig versus Cascading
Like many open-source technologies used in the large-scale data-analytics world, it's
not always clear when to choose Pig over Cascading or over another solution such as
writing Hadoop streaming API scripts. Tools evolve independently from one another,
so the use cases best served by Pig versus Cascading can sometimes overlap, making
decisions about solutions difficult. I generally think of Pig as a workf low tool, whereas
Cascading is better suited as a foundation for building your own workf low applica-
tions. Pig is often the fastest way to run a transformation job.
Analysts who have never written a line of Python or Java should have little trouble
learning how to write their own Pig scripts. A one-time complex transforming job
should certainly use Pig whenever possible; the small amount of code necessary to
complete the task is hard to beat.
One of Cascading's biggest strengths is that it provides an abstraction model that
allows for a great deal of modularity. Another advantage of using Cascading is that, as
a Java Virtual Machine ( JVM)-based API, it can use all of the rich tools and frame-
works in the Java ecosystem.
Summary
Pig and Cascading are two very different open-source tools for building complex
data workf lows that run on Hadoop. Pig is a data processing platform that provides
an easy-to-use syntax for defining procedural workf low steps. Cascading is a well-
designed and popular data processing API for building robust workf low applications.
Cascading simplifies data using a metaphor that equates data to water: sources, sinks,
taps, and pipes. Data streams can cascade . Thus Cascading is useful for building, test-
ing, and deploying robust data applications. Because Cascading uses the Java Virtual
Machine, it has also become the basis for data-application APIs in other languages that
run on the JVM.
 
 
 
Search WWH ::




Custom Search