Improving Performance with Parallel Programming - Clojure Data Analysis

Database Reference

In-Depth Information

The data that we'll work on will be a sequence of strings that contain words and numbers.

We'll convert all of the letters to lowercase and all of the numbers to integers. Based on this

speciication, the irst step of the processing pipeline will be str/lower-case . The second

step will be the ->int function:

(defn ->int [x]

(try

(Long/parseLong x)

(catch Exception e

x)))

The data that we'll work on will be this list:

(def data

(str/split (str "This is a small list. It contains 42 "

"items. Or less.")

#"\s+"))

If you run this using clojure.core/map , you will get the results that you had expected:

user=> (map ->int

(map str/lower-case

data))

("this" "is" "a" "small" "list." "it" "contains" 42 "items." "or"

"less.")

The problem with this approach isn't the results; it's what Clojure is doing between the two

calls to map . In this case, the irst map creates an entirely new lazy sequence. The second

map walks over it again before throwing it and its contents away. Repeatedly allocating lists

and immediately throwing them away is wasteful. It takes more time, and can potentially

consume more memory, than you have available. In this case, this isn't really a problem,

but for longer pipelines of the map calls (potentially processing long sequences) this can

be a performance problem.

This is a problem that reducers address. Let's change our calls to map into calls to

clojure.reducers/map and see what happens:

user=> (r/map ->int

(r/map str/lower-case

data))

#<reducers$folder$reify__1529 clojure.core.reducers$folder$reify__152

9@37577fd6>

Search WWH ::

Custom Search

Home