Database Reference
In-Depth Information
// split "line" on tabs
pipe = new Each ( pipe , new Fields ( "line" ), new RegexSplitter ( "\t" ));
pipe = new LogParser ( pipe );
pipe = new LogRules ( pipe );
// testing only assertions
pipe = new ParserAssertions ( pipe );
Flow flow = new FlowConnector (). connect ( source , sink , pipe );
flow . complete (); // run the test flow
// Verify there are 98 tuples and 2 fields, and matches the regex
pattern
// For TextLine schemes the tuples are { "offset", "line" }
validateLength ( flow , 98 , 2 ,
Pattern . compile ( "^[0-9]+(\\t[^\\t]*){19}$" ));
}
For integration and deployment, many of the features built into Cascading allowed for
easier integration with external systems and for greater process tolerance.
In production, all the subassemblies are joined and planned into a Flow , but instead of
just source and sink Tap s, trap Tap s were planned in ( Figure 24-10 ). Normally, when an
operation throws an exception from a remote mapper or reducer task, the Flow will fail
and kill all its managed MapReduce jobs. When a Flow has traps, any exceptions are
caught and the data causing the exception is saved to the Tap associated with the current
trap. Then the next Tuple is processed without stopping the Flow . Sometimes you want
your Flow s to fail on errors, but in this case, the ShareThis developers knew they could
go back and look at the “failed” data and update their unit tests while the production sys-
tem kept running. Losing a few hours of processing time was worse than losing a couple
of bad records.
Search WWH ::




Custom Search