Databases Reference
In-Depth Information
Example 4 in Cascalog: Replicated Joins
Next, let's review the Cascalog code for an app similar to the Cascading version in
“Example 4: Replicated Joins” on page 22 . Starting from the “Impatient” source code
directory that you cloned in Git, connect into the part4 subdirectory. Look at the code
in src/impatient/core.clj :
( ns impatient.core
( :use [ cascalog.api ]
[ cascalog.more-taps :only ( hfs-delimited )])
( :require [ clojure.string :as s ]
[ cascalog.ops :as c ])
( :gen-class ))
( defmapcatop split [ line ]
"reads in a line of string and splits it by regex"
( s/split line # "[\[\]\\\(\),.)\s]+" ))
( defn -main [ in out stop & args ]
( let [ rain ( hfs-delimited in :skip-header? true )
stop ( hfs-delimited stop :skip-header? true )]
( ?<- ( hfs-delimited out )
[ ?word ?count ]
( rain _ ?line )
( split ?line :> ?word-dirty )
(( c/comp s/trim s/lower-case ) ?word-dirty :> ?word )
( stop ?word :> false )
( c/count ?count ))))
Again, this begins with a namespace, which serves as the target of a compilation . This
namespace also imports the Clojure string library (denoted by an s/ prefix) plus the
Cascalog aggregator operations (denoted by a c/ prefix).
Next there is a defmapcatop macro that defines a split operation to split text lines into
a token output stream—effectively a generator. This is based on a regex function in the
Clojure string library.
Next there is the main definition, similar to “Example 1: Simplest Possible App in Cas‐
cading” , which now includes a stop source tap identifier to read the stop words list:
• Define and run a query.
• Write output tuples to the out sink tap, in TSV format.
• Output tuple scheme has ?word and ?count fields.
• Generator from the rain source tap identifier, in TSV format.
• Input tuple scheme uses only the ?line field; the _ ignores the first field.
Search WWH ::




Custom Search