Databases Reference
In-Depth Information
• Each line is split into tokens, represented by the ?word-dirty variable.
• A composition c/comp performs a string trim and converts the token represented
by ?word to lowercase.
• The stop data filters out matched tokens, implying a left join.
• An aggregator c/count counts each token, represented by ?count .
It's interesting that the Cascalog code for the Replicated Joins example is actually longer
than its Scalding equivalent. Even so, in Scalding much more of the “how”—the im‐
perative programming aspects—must be articulated. For example, the join, aggregation,
and filters in the Scalding version are more explicit. Also, to be fair, writing those Scald‐
ing examples took some effort to find approaches that conformed to Scala requirements
for the pipes.
Figure 5-1 shows the conceptual flow diagram for “Example 4: Replicated Joins” . Note
that here in the Cascalog version, there is no “pipeline” per se. The workflow is exactly
the definition of the main function. Whereas the Scalding code provides an almost pure
expression of the Cascading flow, the Cascalog version expresses the desired end goal
of the workflow with less imperative “controls” defined. For example, the GroupBy is not
needed. Again, in Cascalog you specify what is required, not how it must be achieved.
To build:
$ lein clean
$ lein uberjar
Created /Users/ceteri/opt/Impatient/part4/target/impatient.jar
To run:
$ rm -rf output
$ hadoop jar ./target/impatient.jar data/rain.txt output/wc data/en.stop
To verify:
$ cat output/wc/part-00000
The results should be the same as in the Cascading version ( “Example 4: Replicated
Joins” ).
Search WWH ::




Custom Search