Database Reference
In-Depth Information
tions ( Function s, Filter s, etc.) could be written and tested independently. Second,
the application was segmented into stages: one for parsing, one for rules, and a final stage
for binning/collating the data, all via the SubAssembly base class described earlier.
The data coming from the ShareThis loggers looks a lot like Apache logs, with date/
timestamps, share URLs, referrer URLs, and a bit of metadata. To use the data for analysis
downstream, the URLs needed to be unpacked (parsing query-string data, domain names,
etc.). So, a top-level SubAssembly was created to encapsulate the parsing, and child
subassemblies were nested inside to handle specific fields if they were sufficiently com-
plex to parse.
The same was done for applying rules. As every Tuple passed through the rules
SubAssembly , it was marked as “bad” if any of the rules were triggered. Along with the
“bad” tag, a description of why the record was bad was added to the Tuple for later re-
view.
Finally, a splitter SubAssembly was created to do two things. First, it allowed for the
tuple stream to split into two: one stream for “good” data and one for “bad” data. Second,
the splitter binned the data into intervals, such as every hour. To do this, only two opera-
tions were necessary: the first to create the interval from the timestamp value already
present in the stream, and the second to use the interval and good/bad metadata to create a
directory path (for example, 05/good/ , where “05” is 5 a.m. and “good” means the Tuple
passed all the rules). This path would then be used by the Cascading TemplateTap , a
special Tap that can dynamically output tuple streams to different locations based on val-
ues in the Tuple . In this case, the TemplateTap used the “path” value to create the fi-
nal output path.
The developers also created a fourth SubAssembly — this one to apply Cascading
Assertions during unit testing. These assertions double-checked that rules and parsing
subassemblies did their job.
In the unit test in Example 24-4 , we see the splitter isn't being tested, but it is added in an-
other integration test not shown.
Example 24-4. Unit testing a Flow
public void testLogParsing () throws IOException
{
Hfs source = new Hfs ( new TextLine ( new Fields ( "line" )), sampleData );
Hfs sink =
new Hfs ( new TextLine (), outputPath + "/parser" , SinkMode . REPLACE );
Pipe pipe = new Pipe ( "parser" );
Search WWH ::




Custom Search