Getting Started - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

$ git clone git://github.com/Cascading/Impatient.git

Once that completes, connect into the part1 subdirectory. You're ready to begin pro‐

gramming in Cascading.

Example 1: Simplest Possible App in Cascading

The first item on our agenda is how to write a simple Cascading app . The goal is clear

and concise: create the simplest possible app in Cascading while following best practices.

This app will copy a file, potentially a very large file, in parallel—in other words, it

performs a distributed copy. No bangs, no whistles, just good solid code.

First, we create a source tap to specify the input data. That data happens to be formatted

as tab-separated values (TSV) with a header row, which the TextDelimited data

scheme handles.

String inPath = args [ 0 ];

Tap inTap = new Hfs ( new TextDelimited ( true , "\t" ), inPath );

Next we create a sink tap to specify the output data, which will also be in TSV format:

String outPath = args [ 1 ];

Tap outTap = new Hfs ( new TextDelimited ( true , "\t" ), outPath );

Then we create a pipe to connect the taps:

Pipe copyPipe = new Pipe ( "copy" );

Here comes the fun part. Get your tool belt ready, because we need to do a little plumb‐

ing. Connect the taps and the pipe to create a flow:

FlowDef flowDef = FlowDef . flowDef ()

. addSource ( copyPipe , inTap )

. addTailSink ( copyPipe , outTap );

The notion of a workflow lives at the heart of Cascading. Instead of thinking in terms

of map and reduce phases in a Hadoop job step, Cascading developers define workflows

and business processes as if they were doing plumbing work.

Enterprise data workflows tend to use lots of job steps. Those job steps are connected

and have dependencies, specified as a directed acyclic graph (DAG) . Cascading uses

FlowDef objects to define how a flow—that is to say, a portion of the DAG—must be

connected. A pipe must connect to both a source and a sink. Done and done. That

defines the simplest flow possible.

Now that we have a flow defined, one last line of code invokes the planner on it. Planning

a flow is akin to the physical plan for a query in SQL. The planner verifies that the correct

fields are available for each operation, that the sequence of operations makes sense, and

that all of the pipes and taps are connected in some meaningful way. If the planner

Search WWH ::

Custom Search

Home