Database Reference
In-Depth Information
tions (
Function
s,
Filter
s, etc.) could be written and tested independently. Second,
the application was segmented into stages: one for parsing, one for rules, and a final stage
for binning/collating the data, all via the
SubAssembly
base class described earlier.
The data coming from the ShareThis loggers looks a lot like Apache logs, with date/
timestamps, share URLs, referrer URLs, and a bit of metadata. To use the data for analysis
downstream, the URLs needed to be unpacked (parsing query-string data, domain names,
etc.). So, a top-level
SubAssembly
was created to encapsulate the parsing, and child
subassemblies were nested inside to handle specific fields if they were sufficiently com-
plex to parse.
The same was done for applying rules. As every
Tuple
passed through the rules
SubAssembly
, it was marked as “bad” if any of the rules were triggered. Along with the
“bad” tag, a description of why the record was bad was added to the
Tuple
for later re-
view.
Finally, a splitter
SubAssembly
was created to do two things. First, it allowed for the
tuple stream to split into two: one stream for “good” data and one for “bad” data. Second,
the splitter binned the data into intervals, such as every hour. To do this, only two opera-
tions were necessary: the first to create the interval from the timestamp value already
present in the stream, and the second to use the interval and good/bad metadata to create a
directory path (for example,
05/good/
, where “05” is 5 a.m. and “good” means the
Tuple
passed all the rules). This path would then be used by the Cascading
TemplateTap
, a
special
Tap
that can dynamically output tuple streams to different locations based on val-
ues in the
Tuple
. In this case, the
TemplateTap
used the “path” value to create the fi-
nal output path.
The developers also created a fourth
SubAssembly
— this one to apply Cascading
Assertions
during unit testing. These assertions double-checked that rules and parsing
subassemblies did their job.
In the unit test in
Example 24-4
,
we see the splitter isn't being tested, but it is added in an-
other integration test not shown.
Example 24-4. Unit testing a Flow
public
void
testLogParsing
()
throws
IOException
{
Hfs source
=
new
Hfs
(
new
TextLine
(
new
Fields
(
"line"
)),
sampleData
);
Hfs sink
=
new
Hfs
(
new
TextLine
(),
outputPath
+
"/parser"
,
SinkMode
.
REPLACE
);
Pipe pipe
=
new
Pipe
(
"parser"
);