Databases Reference
In-Depth Information
( defn word-count [ src ]
"simple word count across all documents"
( <- [ ?word ?count ]
( src _ ?word )
( c/count ?count )))
Next we have functions for the three branches, “D,” “DF,” and “TF.” Note that in Cascalog
a branch is defined as a function—to some extent, this reinforces the concept of clo‐
sures in functional programming, at least much better than could be performed in Java.
A similar construct was also leveraged in the failure trap used in the stream assertion,
for the etl-docs-gen subquery. In Cascading, branch names get propagated through
a pipe assembly, then used in a flow definition to bind failure traps. The specification
of a failure trap gets dispersed through different portions of a Cascading app. In contrast,
Cascalog has branches and traps specified concisely within a function definition, as first-
class language constructs.
( defn D [ src ]
( let [ src ( select-fields src [ "?doc-id" ])]
( <- [ ?n-docs ]
( src ?doc-id )
( c/distinct-count ?doc-id :> ?n-docs ))))
( defn DF [ src ]
( <- [ ?df-word ?df-count ]
( src ?doc-id ?df-word )
( c/distinct-count ?doc-id ?df-word :> ?df-count )))
( defn TF [ src ]
( <- [ ?doc-id ?tf-word ?tf-count ]
( src ?doc-id ?tf-word )
( c/count ?tf-count )))
Note the use of another Cascalog aggregator, the c/distinct-count function. This is
equivalent to the Unique filter in Cascading.
Next we construct two definitions to calculate TF-IDF. The first is the actual formula,
which shows how to use math functions. It also uses a Clojure threading operator ->>
for caching the query results in memory.
The second definition is the function for the “TF-IDF” branch, which implies the joins
needed for the “D,” “DF,” and “TF” branches.
( defn tf-idf-formula [ tf-count df-count n-docs ]
( ->> ( + 1.0 df-count )
( div n-docs )
( Math/log )
( * tf-count )))
( defn TF-IDF [ src ]
( let [ n-doc ( first ( flatten ( ??- ( D src ))))]
Search WWH ::




Custom Search