Database Reference
In-Depth Information
// split "line" on tabs
pipe
=
new
Each
(
pipe
,
new
Fields
(
"line"
),
new
RegexSplitter
(
"\t"
));
pipe
=
new
LogParser
(
pipe
);
pipe
=
new
LogRules
(
pipe
);
// testing only assertions
pipe
=
new
ParserAssertions
(
pipe
);
Flow flow
=
new
FlowConnector
().
connect
(
source
,
sink
,
pipe
);
flow
.
complete
();
// run the test flow
// Verify there are 98 tuples and 2 fields, and matches the regex
pattern
// For TextLine schemes the tuples are { "offset", "line" }
validateLength
(
flow
,
98
,
2
,
Pattern
.
compile
(
"^[0-9]+(\\t[^\\t]*){19}$"
));
}
For integration and deployment, many of the features built into Cascading allowed for
easier integration with external systems and for greater process tolerance.
In production, all the subassemblies are joined and planned into a
Flow
, but instead of
operation throws an exception from a remote mapper or reducer task, the
Flow
will fail
and kill all its managed MapReduce jobs. When a
Flow
has traps, any exceptions are
caught and the data causing the exception is saved to the
Tap
associated with the current
trap. Then the next
Tuple
is processed without stopping the
Flow
. Sometimes you want
your
Flow
s to fail on errors, but in this case, the ShareThis developers knew they could
go back and look at the “failed” data and update their unit tests while the production sys-
tem kept running. Losing a few hours of processing time was worse than losing a couple
of bad records.