Getting Started - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Fields token = new Fields ( "token" );

Fields text = new Fields ( "text" );

RegexSplitGenerator splitter

= new RegexSplitGenerator ( token , "[ \\[\\]\$\$,.]" );

// returns only "token"

Pipe docPipe = new Each ( "token" , text , splitter , Fields . RESULTS );

Out of that pipe, we get a tuple stream of token values. One benefit of using a regex is

that it's simple to change. We can handle more complex cases of splitting tokens without

having to rewrite the generator.

Next, we use a GroupBy to count the occurrences of each token:

Pipe wcPipe = new Pipe ( "wc" , docPipe );

wcPipe = new GroupBy ( wcPipe , token );

wcPipe = new Every ( wcPipe , Fields . ALL , new Count (), Fields . ALL );

Notice that we've used Each and Every to perform operations within the pipe assembly.

The difference between these two is that an Each operates on individual tuples so that

it takes Function operations. An Every operates on groups of tuples so that it takes

Aggregator or Buffer operations—in this case the GroupBy performed an aggregation.

The different ways of inserting operations serve to categorize the different built-in op‐

erations in Cascading. They also illustrate how the pattern language syntax guides the

development of workflows.

From that wcPipe we get a resulting tuple stream of token and count for the output.

Again, we connect the plumbing with a FlowDef :

FlowDef flowDef = FlowDef . flowDef ()

. setName ( "wc" )

. addSource ( docPipe , docTap )

. addTailSink ( wcPipe , wcTap );

Finally, we generate a DOT file to depict the Cascading flow graphically. You can load

the DOT file into OmniGraffle or Visio. Those diagrams are really helpful for trouble‐

shooting workflows in Cascading:

Flow wcFlow = flowConnector . connect ( flowDef );

wcFlow . writeDOT ( "dot/wc.dot" );

wcFlow . complete ();

This code is already in the part2/src/main/java/impatient/ directory, in the Main.java

file. To build it:

$ gradle clean jar

Then to run it:

$ rm -rf output

$ hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc

Search WWH ::

Custom Search

Home