Databases Reference
In-Depth Information
To build:
$
lein clean
$
lein uberjar
Created /Users/ceteri/opt/Impatient/part6/target/impatient.jar
To run:
$
rm -rf output
$
hadoop jar target/impatient.jar data/rain.txt output/wc
\
data/en.stop output/tfidf
To verify the output:
$
cat output/trap/part-m-00001-00001
zoink
$
head output/tfidf/part-00000
doc02 0.22314355131420976 area
doc01 0.44628710262841953 area
doc03 0.22314355131420976 area
doc05 0.9162907318741551 australia
doc05 0.9162907318741551 broken
doc04 0.9162907318741551 california
'
s
doc04 0.9162907318741551 cause
doc02 0.9162907318741551 cloudcover
doc04 0.9162907318741551 death
doc04 0.9162907318741551 deserts
“Example 6 in Cascalog: TF-IDF with Testing”
also includes unit tests, with source code
in the
test/impatient/core_test.clj
file:
(
ns
impatient.core-test
(
:use
impatient.core
clojure.test
cascalog.api
[
midje
sweet
cascalog
]))
(
deftest
scrub-text-test
(
fact
(
scrub-text
"FoO BAR "
)
=>
"foo bar"
))
(
deftest
etl-docs-gen-test
(
let
[
rain
[[
"doc1"
"a b c"
]]
stop
[[
"b"
]]]
(
fact
(
etl-docs-gen
rain
stop
)
=>
(
produces
[[
"doc1"
"a"
]
[
"doc1"
"c"
]]))))
Note the reference to
midje
in the namespace. These tests are based on a test framework
called Midje-Cascalog, described by Ritchie on
his GitHub project
and in substantially
more detail in
his article about best practices for Cascalog testing
.
Midje enables you to test Cascalog queries as functions, whether they are isolated or
within part of a workflow. Each test definition shown in the preceding code uses
fact