Distributed Data Processing with Cascalog - Clojure Data Analysis

Database Reference

In-Depth Information

Composing Cascalog queries

One of the best things about Cascalog queries is that they can be composed together.

Similar to composing functions, this can be a good way to build a complex process from

smaller, easy-to-understand parts.

In this recipe, we'll parse the Virginia census data we irst used in the Managing program

complexity with STM recipe in Chapter 3 , Managing Complexity with Concurrent Programming .

You can download this data from http://www.ericrochester.com/clj-data-

analysis/data/all_160_in_51.P35.csv . We'll also use a new census dataile that

contains the race data. You can download it from http://www.ericrochester.com/

clj-data-analysis/data/all_160_in_51.P3.csv .

Getting ready

Since we're reading CSV, we'll need to use the dependencies and imports from the

Parsing CSV iles with Cascalog recipe.

We'll also use the hfs-text-delim function from that recipe and ->long from the

Aggregating data with Cascalog recipe.

Also, we'll need the data iles from http://www.ericrochester.com/clj-data-

analysis/data/all_160_in_51.P35.csv and http://www.ericrochester.com/

clj-data-analysis/data/all_160_in_51.P3.csv . We'll put them into the data

directory, as follows:

(def families-file "data/all_160_in_51.P35.csv")

(def race-file "data/all_160_in_51.P3.csv")

How to do it…

We'll read these datasets and convert some of the ields in each to integers. Then we'll join

the two together and select only a few of the ields.

1. We'll deine a query that reads the families data ile and converts the integer ields

to numbers:

(def family-data

(<- [?GEOID ?SUMLEV ?STATE

?NAME ?POP100 ?HU100 ?P035001]

((hfs-text-delim families-file

Search WWH ::

Custom Search

Home