Distributed Data Processing with Cascalog - Clojure Data Analysis

Database Reference

In-Depth Information

Aggregating data with Cascalog

So far, the Cascalog queries you saw have all returned tables of results. However, sometimes

you'll want to aggregate the tables in order to boil them down to a single value or into a table

where groups from the original data are aggregated.

Cascalog also makes this easy to do, and it includes a number of aggregate functions. For

this recipe, we'll only use two— cascalog.logic.opts/distinct-count and cascalog.

logic.ops/sumsum —but you can ind more easily in the API documentation on the Cascalog

website ( http://nathanmarz.github.io/cascalog/cascalog.logic.ops.html ) .

Getting ready

We'll use the same dependencies and imports as we did in Parsing CSV Files with Cascalog .

We'll also use the same data that we deined in that recipe.

How to do it…

We'll take a look at a couple of examples on how to aggregate data with the count function:

1.

First, we'll query how many:

user=> (?<- (stdout)

[?count]

((hfs-text-delim "data/16285/flights_with_colnames.csv"

:has-header true)

?origin_airport _ _ _ _)

(:distinct true)

(c/distinct-count ?origin_airport :> ?count) )

…

RESULTS

-----------------------

683

-----------------------

For this, we need to specify that we want to have distinct results for entire rows (the

default). Then specify that we just include the aggregate operator as a predicate and

give its results to a new name binding ( ?count ). We use this binding—and only this

binding—in the results. The other predicates in the query are used to select the data

that we want aggregated.

Search WWH ::

Custom Search

Home