Database Reference
In-Depth Information
How it works…
The last version of the function, lazy-read-csv , works because it takes the lazy sequence
that csv/read-csv produces and wraps it in another sequence that closes the input ile
when there is no more data coming out of the CSV ile. This is complicated because we're
working with two levels of input: reading from the ile and reading CSV. When the higher-level
task (reading CSV) is completed, it triggers an operation on the lower level (reading the ile).
This allows you to read iles that don't it into the memory and process their data on the ly.
However, with this function, we again have a nice, simple interface that we can present to
callers while keeping the complexity hidden.
Unfortunately, this still has one glaring problem: if we're not going to read the entire ile
(say we're only interested in the irst 100 lines or something) the ile handle won't get closed.
For the use cases in which only a part of the ile will be read, lazy-read-ok is probably the
best option.
Sampling from very large data sets
One way to deal with very large data sets is to sample. This can be especially useful when
we're getting started and want to explore a dataset. A good sample can tell us what's in the
full dataset and what we'll need to do in order to clean and process it. Samples are used in
any kind of survey or election exit polling.
In this recipe, we'll see a couple of ways of creating samples.
Getting ready
We'll use a basic project.clj ile:
(defproject cleaning-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]])
How to do it…
There are two ways to sample from a stream of values. If you want 10 percent of the larger
population, you can just take every tenth item. If you want 1,000 out of who knows how many
items, the process is a little more complicated.
 
Search WWH ::




Custom Search