Database Reference
In-Depth Information
Hadoop itself is written in Java. The standard Hadoop Java API exposes the ability
to define classes and interfaces for every detail in a MapReduce environment. This is
fine for some applications, but, as we have seen already, developing an application by
managing MapReduce steps can become unwieldy. In other words, the raw Hadoop
Java API doesn't match the level of abstraction provided by a tool like Pig.
Although there are many data API frameworks built to improve the utility of
developing Hadoop MapReduce applications, one of the most powerful and popu-
lar open-source frameworks is Cascading. Originally developed by Chris Wensel,
Cascading provides a well-thought-out programming interface and is a great intro-
duction to thinking about data in terms of streams. Cascading is a great addition to
Hadoop's standard MapReduce interface, and it has resulted in a large ecosystem of
tools built on top of it. In fact, I can't imagine many situations in which any type of
Java-based Hadoop workflow—whether simple or complex—should be built without
Cascading.
Thinking in Terms of Sources and Sinks
Thanks to humanity's great need for water, a metaphor we are all familiar with is
f low. Water can f low from a reservoir source, and the f low can be split into multiple
destinations. One of these streams may end up in a bathtub, whereas another might
be sent to be converted into steam to make coffee. The underlying details of how
individual water molecules f low is not really a concern to many of us. As long as
individual pipes are able to connect, most of us don't think much about how the pro-
cess of “f lowing” really works. Water goes in one end of a pipe and comes out
the other.
If you've ever used Unix command-line tools, you might already be familiar with
the “pipe” paradigm. In a Unix pipe, the output of one process becomes the input of
another. These operations may be chained together one after another. Even better, as
long as each component in the pipeline completes the specific tasks as expected, the
user doesn't have to worry much about the individual commands.
In Chapter 8, we introduced the Hadoop streaming API, which extends the con-
cepts of software pipelines as an abstraction for defining distributed MapReduce algo-
rithms. The output data of the mapper functions become the input data of reducer
functions.
Cascading provides yet another layer of abstraction on top of MapReduce, to help
us think of data as more like streams of water and less as individual MapReduce steps.
Although Hadoop provides a layer of abstraction to easily manage distributed applica-
tions over a cluster of machines, Cascading provides an abstract model of processing
data on a Hadoop framework.
In the Cascading model, data inputs (or sources ) and outputs (known as sinks ) are fed
into the application through taps . These taps are connected together by pipes , which
can be combined, split, and even run through filters. Finally, any number of f lows can
be assembled together. When f lows are linked together, the result is a cascade .
 
Search WWH ::




Custom Search