Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

How it works…

The last version of the function, lazy-read-csv , works because it takes the lazy sequence

that csv/read-csv produces and wraps it in another sequence that closes the input ile

when there is no more data coming out of the CSV ile. This is complicated because we're

working with two levels of input: reading from the ile and reading CSV. When the higher-level

task (reading CSV) is completed, it triggers an operation on the lower level (reading the ile).

This allows you to read iles that don't it into the memory and process their data on the ly.

However, with this function, we again have a nice, simple interface that we can present to

callers while keeping the complexity hidden.

Unfortunately, this still has one glaring problem: if we're not going to read the entire ile

(say we're only interested in the irst 100 lines or something) the ile handle won't get closed.

For the use cases in which only a part of the ile will be read, lazy-read-ok is probably the

best option.

Sampling from very large data sets

One way to deal with very large data sets is to sample. This can be especially useful when

we're getting started and want to explore a dataset. A good sample can tell us what's in the

full dataset and what we'll need to do in order to clean and process it. Samples are used in

any kind of survey or election exit polling.

In this recipe, we'll see a couple of ways of creating samples.

Getting ready

We'll use a basic project.clj ile:

(defproject cleaning-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]])

How to do it…

There are two ways to sample from a stream of values. If you want 10 percent of the larger

population, you can just take every tenth item. If you want 1,000 out of who knows how many

items, the process is a little more complicated.

Search WWH ::

Custom Search

Home