Databases Reference
In-Depth Information
Example 4 in Cascalog: Replicated Joins
Next, let's review the Cascalog code for an app similar to the Cascading version in
“Example 4: Replicated Joins” on page 22
. Starting from the “Impatient” source code
directory that you cloned in Git, connect into the
part4
subdirectory. Look at the code
in
src/impatient/core.clj
:
(
ns
impatient.core
(
:use
[
cascalog.api
]
[
cascalog.more-taps
:only
(
hfs-delimited
)])
(
:require
[
clojure.string
:as
s
]
[
cascalog.ops
:as
c
])
(
:gen-class
))
(
defmapcatop
split
[
line
]
"reads in a line of string and splits it by regex"
(
s/split
line
#
"[\[\]\\\(\),.)\s]+"
))
(
defn
-main
[
in
out
stop
&
args
]
(
let
[
rain
(
hfs-delimited
in
:skip-header?
true
)
stop
(
hfs-delimited
stop
:skip-header?
true
)]
(
?<-
(
hfs-delimited
out
)
[
?word
?count
]
(
rain
_
?line
)
(
split
?line
:>
?word-dirty
)
((
c/comp
s/trim
s/lower-case
)
?word-dirty
:>
?word
)
(
stop
?word
:>
false
)
(
c/count
?count
))))
Again, this begins with a namespace, which serves as the target of a
compilation
.
This
namespace also imports the Clojure string library (denoted by an
s/
prefix) plus the
Cascalog aggregator operations (denoted by a
c/
prefix).
Next there is a
defmapcatop
macro that defines a
split
operation to split text lines into
a token output stream—effectively a generator. This is based on a regex function in the
Clojure string library.
Next there is the
main
definition, similar to
“Example 1: Simplest Possible App in Cas‐
cading”
, which now includes a
stop
source tap identifier to read the stop words list:
• Define and run a query.
• Write output tuples to the
out
sink tap, in TSV format.
• Output tuple scheme has
?word
and
?count
fields.
• Generator from the
rain
source tap identifier, in TSV format.
• Input tuple scheme uses only the
?line
field; the
_
ignores the first field.