Cascalog—A Clojure DSL for Cascading - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

( defn word-count [ src ]

"simple word count across all documents"

( <- [ ?word ?count ]

( src _ ?word )

( c/count ?count )))

Next we have functions for the three branches, “D,” “DF,” and “TF.” Note that in Cascalog

a branch is defined as a function—to some extent, this reinforces the concept of clo‐

sures in functional programming, at least much better than could be performed in Java.

A similar construct was also leveraged in the failure trap used in the stream assertion,

for the etl-docs-gen subquery. In Cascading, branch names get propagated through

a pipe assembly, then used in a flow definition to bind failure traps. The specification

of a failure trap gets dispersed through different portions of a Cascading app. In contrast,

Cascalog has branches and traps specified concisely within a function definition, as first-

class language constructs.

( defn D [ src ]

( let [ src ( select-fields src [ "?doc-id" ])]

( <- [ ?n-docs ]

( src ?doc-id )

( c/distinct-count ?doc-id :> ?n-docs ))))

( defn DF [ src ]

( <- [ ?df-word ?df-count ]

( src ?doc-id ?df-word )

( c/distinct-count ?doc-id ?df-word :> ?df-count )))

( defn TF [ src ]

( <- [ ?doc-id ?tf-word ?tf-count ]

( src ?doc-id ?tf-word )

( c/count ?tf-count )))

Note the use of another Cascalog aggregator, the c/distinct-count function. This is

equivalent to the Unique filter in Cascading.

Next we construct two definitions to calculate TF-IDF. The first is the actual formula,

which shows how to use math functions. It also uses a Clojure threading operator ->>

for caching the query results in memory.

The second definition is the function for the “TF-IDF” branch, which implies the joins

needed for the “D,” “DF,” and “TF” branches.

( defn tf-idf-formula [ tf-count df-count n-docs ]

( ->> ( + 1.0 df-count )

( div n-docs )

( Math/log )

( * tf-count )))

( defn TF-IDF [ src ]

( let [ n-doc ( first ( flatten ( ??- ( D src ))))]

Search WWH ::

Custom Search

Home