Database Reference
In-Depth Information
There's more…
This approach to parse dates has a number of problems. For example, because some date
formats are ambiguous, the irst match might not be the correct one.
However, trying out a list of formats is probably about the best we can do. Knowing something
about our data allows us to prioritize the list appropriately, and we can augment it with ad
hoc formats as we run across new data. We might also need to normalize data from different
sources (for instance, U.S. date formats versus the rest of the world) before we merge the
data together.
Lazily processing very large data sets
One of the good features of Clojure is that most of its sequence-processing functions are lazy.
This allows us to handle very large datasets with very little effort. However, when combined
with readings from iles and other I/O, there are several things that you need to watch out for.
In this recipe, we'll take a look at several ways to safely and lazily read a CSV ile. By default,
the clojure.data.csv/read-csv is lazy, so how do you maintain this feature while
closing the ile at the right time?
Getting ready
We'll use a project.clj ile that includes a dependency on the Clojure CSV library:
(defproject cleaning-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[org.clojure/data.csv "0.1.2"]])
We need to load the libraries that we're going to use into the REPL:
(require '[clojure.data.csv :as csv]
'[clojure.java.io :as io])
How to do it…
We'll try several solutions and consider their strengths and weaknesses:
1.
Let's start with the most straightforward way:
(defn lazy-read-bad-1 [csv-file]
(with-open [in-file (io/reader csv-file)]
(csv/read-csv in-file)))
user=> (lazy-read-bad-1 "data/small-sample.csv")
IOException Stream closed java.io.BufferedReader.ensureOpen
(BufferedReader.java:97)
 
Search WWH ::




Custom Search