Databases Reference
In-Depth Information
Fields token = new Fields ( "token" );
Fields text = new Fields ( "text" );
RegexSplitGenerator splitter
= new RegexSplitGenerator ( token , "[ \\[\\]\\(\\),.]" );
// returns only "token"
Pipe docPipe = new Each ( "token" , text , splitter , Fields . RESULTS );
Out of that pipe, we get a tuple stream of token values. One benefit of using a regex is
that it's simple to change. We can handle more complex cases of splitting tokens without
having to rewrite the generator.
Next, we use a GroupBy to count the occurrences of each token:
Pipe wcPipe = new Pipe ( "wc" , docPipe );
wcPipe = new GroupBy ( wcPipe , token );
wcPipe = new Every ( wcPipe , Fields . ALL , new Count (), Fields . ALL );
Notice that we've used Each and Every to perform operations within the pipe assembly.
The difference between these two is that an Each operates on individual tuples so that
it takes Function operations. An Every operates on groups of tuples so that it takes
Aggregator or Buffer operations—in this case the GroupBy performed an aggregation.
The different ways of inserting operations serve to categorize the different built-in op‐
erations in Cascading. They also illustrate how the pattern language syntax guides the
development of workflows.
From that wcPipe we get a resulting tuple stream of token and count for the output.
Again, we connect the plumbing with a FlowDef :
FlowDef flowDef = FlowDef . flowDef ()
. setName ( "wc" )
. addSource ( docPipe , docTap )
. addTailSink ( wcPipe , wcTap );
Finally, we generate a DOT file to depict the Cascading flow graphically. You can load
the DOT file into OmniGraffle or Visio. Those diagrams are really helpful for trouble‐
shooting workflows in Cascading:
Flow wcFlow = flowConnector . connect ( flowDef );
wcFlow . writeDOT ( "dot/wc.dot" );
wcFlow . complete ();
This code is already in the part2/src/main/java/impatient/ directory, in the Main.java
file. To build it:
$ gradle clean jar
Then to run it:
$ rm -rf output
$ hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc
Search WWH ::




Custom Search