Improving Performance with Parallel Programming - Clojure Data Analysis

Database Reference

In-Depth Information

Parallelizing processing with Incanter

In the upcoming chapters, many recipes will feature Incanter. One of its good features

is that it uses the Parallel Colt Java library ( http://sourceforge.net/projects/

parallelcolt/ ) to actually handle its processing. So when you use a lot of matrix,

statistical, or other functions, they're automatically executed on multiple threads.

For this, we'll revisit the Virginia housing-unit census data from the Managing program

complexity with STM recipe in Chapter 3 , Managing Complexity with Concurrent

Programming . This time, we'll it it to a linear regression.

Getting ready

We need to add Incanter to our list of dependencies in our Leiningen project.clj ile:

(defproject parallel-data "0.1.0"

:dependencies [[org.clojure/clojure "1.6.0"]

[incanter "1.5.5"]])

We also need to pull these libraries into our REPL or script:

(use '(incanter core datasets io optimize charts stats))

We'll use the data ile from the Managing program complexity with STM recipe in Chapter 3 ,

Managing Complexity with Concurrent Programming . We can bind that ilename to the

name data-file , just as we did in that recipe:

(def data-file "data/all_160_in_51.P35.csv")

How to do it…

For this recipe, we'll extract the data to be analyzed and perform a linear regression. We'll

then graph the data.

1.

First, we'll read in the data and pull the population and housing-unit columns into

their own matrices:

(def data (to-matrix

(sel (read-dataset data-file :header true)

:cols [:POP100 :HU100])))

2.

From this matrix, we can bind the population and the housing-unit data to their

own names:

(def population (sel data :cols 0))

(def housing-units (sel data :cols 1))

Search WWH ::

Custom Search

Home