Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Hadoop itself is written in Java. The standard Hadoop Java API exposes the ability

to define classes and interfaces for every detail in a MapReduce environment. This is

fine for some applications, but, as we have seen already, developing an application by

managing MapReduce steps can become unwieldy. In other words, the raw Hadoop

Java API doesn't match the level of abstraction provided by a tool like Pig.

Although there are many data API frameworks built to improve the utility of

developing Hadoop MapReduce applications, one of the most powerful and popu-

lar open-source frameworks is Cascading. Originally developed by Chris Wensel,

Cascading provides a well-thought-out programming interface and is a great intro-

duction to thinking about data in terms of streams. Cascading is a great addition to

Hadoop's standard MapReduce interface, and it has resulted in a large ecosystem of

tools built on top of it. In fact, I can't imagine many situations in which any type of

Java-based Hadoop workflow—whether simple or complex—should be built without

Cascading.

Thinking in Terms of Sources and Sinks

Thanks to humanity's great need for water, a metaphor we are all familiar with is

f low. Water can f low from a reservoir source, and the f low can be split into multiple

destinations. One of these streams may end up in a bathtub, whereas another might

be sent to be converted into steam to make coffee. The underlying details of how

individual water molecules f low is not really a concern to many of us. As long as

individual pipes are able to connect, most of us don't think much about how the pro-

cess of “f lowing” really works. Water goes in one end of a pipe and comes out

the other.

If you've ever used Unix command-line tools, you might already be familiar with

the “pipe” paradigm. In a Unix pipe, the output of one process becomes the input of

another. These operations may be chained together one after another. Even better, as

long as each component in the pipeline completes the specific tasks as expected, the

user doesn't have to worry much about the individual commands.

In Chapter 8, we introduced the Hadoop streaming API, which extends the con-

cepts of software pipelines as an abstraction for defining distributed MapReduce algo-

rithms. The output data of the mapper functions become the input data of reducer

functions.

Cascading provides yet another layer of abstraction on top of MapReduce, to help

us think of data as more like streams of water and less as individual MapReduce steps.

Although Hadoop provides a layer of abstraction to easily manage distributed applica-

tions over a cluster of machines, Cascading provides an abstract model of processing

data on a Hadoop framework.

In the Cascading model, data inputs (or sources ) and outputs (known as sinks ) are fed

into the application through taps . These taps are connected together by pipes , which

can be combined, split, and even run through filters. Finally, any number of f lows can

be assembled together. When f lows are linked together, the result is a cascade .

Search WWH ::

Custom Search

Home