Databases Reference
In-Depth Information
are available to assist.
Example 1 in Cascalog: Simplest Possible App
The tutorial examples show Cascalog code snippets run in the interactive REPL. Next
let's use
lein
to build a “fat jar” that can also run on an Apache Hadoop cluster.
Paul Lam of uSwitch has translated each of the “Impatient” series of Cascading apps
into Cascalog,
some of which are more expressive than the originals
.
Connect somewhere you have space for downloads, and then use Git to clone the Cas‐
calog version of “Impatient” on GitHub:
$
git clone git://github.com/Quantisan/Impatient.git
Connect into the
part8
subdirectory. Then we'll review an app in Cascalog for a dis‐
tributed file copy, similar to
“Example 1: Simplest Possible App in Cascading” on page
3
:
cd
Impatient/part1
Source is located in the
src/impatient/core.clj
file:
(
ns
impatient.core
(
:use
[
cascalog.api
]
[
cascalog.more-taps
:only
(
hfs-delimited
)])
(
:gen-class
))
(
defn
-main
[
in
out
&
args
]
(
?<-
(
hfs-delimited
out
)
[
?doc
?line
]
((
hfs-delimited
in
:skip-header?
true
)
?doc
?line
)))
use packages and imports for similar reasons, but in general Clojure namespaces are
more advanced. For example, they provide better features for avoiding naming colli‐
sions. Namespaces are also first-class constructs in Clojure, so they can be composed
dynamically. In this example, the namespace imports the Cascalog API, plus additional
definitions for Cascading taps—such as
TextDelimited
for TSV format.
gous to the
Main
method in
“Example 1: Simplest Possible App in Cascading”
. It has
arguments for the
in
source tap identifier and the
out
sink tap identifier, plus an
args
argument list for arity overloading. A query writes output in TSV format for each tuple
of
?doc
and
?line
fields from the input tuple stream. Note the property
:skip-
header?
set to
true
, which causes the source tap to skip headers in the input TSV data.