Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

There's more…

This approach to parse dates has a number of problems. For example, because some date

formats are ambiguous, the irst match might not be the correct one.

However, trying out a list of formats is probably about the best we can do. Knowing something

about our data allows us to prioritize the list appropriately, and we can augment it with ad

hoc formats as we run across new data. We might also need to normalize data from different

sources (for instance, U.S. date formats versus the rest of the world) before we merge the

data together.

Lazily processing very large data sets

One of the good features of Clojure is that most of its sequence-processing functions are lazy.

This allows us to handle very large datasets with very little effort. However, when combined

with readings from iles and other I/O, there are several things that you need to watch out for.

In this recipe, we'll take a look at several ways to safely and lazily read a CSV ile. By default,

the clojure.data.csv/read-csv is lazy, so how do you maintain this feature while

closing the ile at the right time?

Getting ready

We'll use a project.clj ile that includes a dependency on the Clojure CSV library:

(defproject cleaning-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[org.clojure/data.csv "0.1.2"]])

We need to load the libraries that we're going to use into the REPL:

(require '[clojure.data.csv :as csv]

'[clojure.java.io :as io])

How to do it…

We'll try several solutions and consider their strengths and weaknesses:

1.

Let's start with the most straightforward way:

(defn lazy-read-bad-1 [csv-file]

(with-open [in-file (io/reader csv-file)]

(csv/read-csv in-file)))

user=> (lazy-read-bad-1 "data/small-sample.csv")

IOException Stream closed java.io.BufferedReader.ensureOpen

(BufferedReader.java:97)

Search WWH ::

Custom Search

Home