Databases Reference
In-Depth Information
$ git clone git://github.com/Cascading/Impatient.git
Once that completes, connect into the part1 subdirectory. You're ready to begin pro‐
gramming in Cascading.
Example 1: Simplest Possible App in Cascading
The first item on our agenda is how to write a simple Cascading app . The goal is clear
and concise: create the simplest possible app in Cascading while following best practices.
This app will copy a file, potentially a very large file, in parallel—in other words, it
performs a distributed copy. No bangs, no whistles, just good solid code.
First, we create a source tap to specify the input data. That data happens to be formatted
as tab-separated values (TSV) with a header row, which the TextDelimited data
scheme handles.
String inPath = args [ 0 ];
Tap inTap = new Hfs ( new TextDelimited ( true , "\t" ), inPath );
Next we create a sink tap to specify the output data, which will also be in TSV format:
String outPath = args [ 1 ];
Tap outTap = new Hfs ( new TextDelimited ( true , "\t" ), outPath );
Then we create a pipe to connect the taps:
Pipe copyPipe = new Pipe ( "copy" );
Here comes the fun part. Get your tool belt ready, because we need to do a little plumb‐
ing. Connect the taps and the pipe to create a flow:
FlowDef flowDef = FlowDef . flowDef ()
. addSource ( copyPipe , inTap )
. addTailSink ( copyPipe , outTap );
The notion of a workflow lives at the heart of Cascading. Instead of thinking in terms
of map and reduce phases in a Hadoop job step, Cascading developers define workflows
and business processes as if they were doing plumbing work.
Enterprise data workflows tend to use lots of job steps. Those job steps are connected
and have dependencies, specified as a directed acyclic graph (DAG) . Cascading uses
FlowDef objects to define how a flow—that is to say, a portion of the DAG—must be
connected. A pipe must connect to both a source and a sink. Done and done. That
defines the simplest flow possible.
Now that we have a flow defined, one last line of code invokes the planner on it. Planning
a flow is akin to the physical plan for a query in SQL. The planner verifies that the correct
fields are available for each operation, that the sequence of operations makes sense, and
that all of the pipes and taps are connected in some meaningful way. If the planner
Search WWH ::




Custom Search