Cascalog—A Clojure DSL for Cascading - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

• Each line is split into tokens, represented by the ?word-dirty variable.

• A composition c/comp performs a string trim and converts the token represented

by ?word to lowercase.

• The stop data filters out matched tokens, implying a left join.

• An aggregator c/count counts each token, represented by ?count .

It's interesting that the Cascalog code for the Replicated Joins example is actually longer

than its Scalding equivalent. Even so, in Scalding much more of the “how”—the im‐

perative programming aspects—must be articulated. For example, the join, aggregation,

and filters in the Scalding version are more explicit. Also, to be fair, writing those Scald‐

ing examples took some effort to find approaches that conformed to Scala requirements

for the pipes.

Figure 5-1 shows the conceptual flow diagram for “Example 4: Replicated Joins” . Note

that here in the Cascalog version, there is no “pipeline” per se. The workflow is exactly

the definition of the main function. Whereas the Scalding code provides an almost pure

expression of the Cascading flow, the Cascalog version expresses the desired end goal

of the workflow with less imperative “controls” defined. For example, the GroupBy is not

needed. Again, in Cascalog you specify what is required, not how it must be achieved.

To build:

$ lein clean

$ lein uberjar

Created /Users/ceteri/opt/Impatient/part4/target/impatient.jar

To run:

$ rm -rf output

$ hadoop jar ./target/impatient.jar data/rain.txt output/wc data/en.stop

To verify:

$ cat output/wc/part-00000

The results should be the same as in the Cascading version ( “Example 4: Replicated

Joins” ).

Search WWH ::

Custom Search

Home