Databases Reference
In-Depth Information
(
defn
word-count
[
src
]
"simple word count across all documents"
(
<-
[
?word
?count
]
(
src
_
?word
)
(
c/count
?count
)))
Next we have functions for the three branches, “D,” “DF,” and “TF.” Note that in Cascalog
a branch is defined as a function—to some extent, this reinforces the concept of
clo‐
sures
in functional programming, at least much better than could be performed in Java.
A similar construct was also leveraged in the failure trap used in the stream assertion,
for the
etl-docs-gen
subquery. In Cascading, branch names get propagated through
a pipe assembly, then used in a flow definition to bind failure traps. The specification
of a failure trap gets dispersed through different portions of a Cascading app. In contrast,
Cascalog has branches and traps specified concisely within a function definition, as first-
class language constructs.
(
defn
D
[
src
]
(
let
[
src
(
select-fields
src
[
"?doc-id"
])]
(
<-
[
?n-docs
]
(
src
?doc-id
)
(
c/distinct-count
?doc-id
:>
?n-docs
))))
(
defn
DF
[
src
]
(
<-
[
?df-word
?df-count
]
(
src
?doc-id
?df-word
)
(
c/distinct-count
?doc-id
?df-word
:>
?df-count
)))
(
defn
TF
[
src
]
(
<-
[
?doc-id
?tf-word
?tf-count
]
(
src
?doc-id
?tf-word
)
(
c/count
?tf-count
)))
Note the use of another Cascalog aggregator, the
c/distinct-count
function. This is
equivalent to the
Unique
filter in Cascading.
Next we construct two definitions to calculate TF-IDF. The first is the actual formula,
which shows how to use math functions. It also uses a Clojure threading operator
->>
for caching the query results in memory.
The second definition is the function for the “TF-IDF” branch, which implies the joins
needed for the “D,” “DF,” and “TF” branches.
(
defn
tf-idf-formula
[
tf-count
df-count
n-docs
]
(
->>
(
+
1.0
df-count
)
(
div
n-docs
)
(
Math/log
)
(
*
tf-count
)))
(
defn
TF-IDF
[
src
]
(
let
[
n-doc
(
first
(
flatten
(
??-
(
D
src
))))]