Database Reference
In-Depth Information
This is often an iterative, interactive process. If it's a very large dataset, I might create a sample
to work with at this stage. Generally, I start by examining the data iles. Once I ind a problem, I
try to code a solution, which I run on the dataset. After each change, I archive the data, either
using a ZIP ile or, if the data iles are small enough, a version control system. Using a version
control system is a good option because I can track the code to transform the data along with
the data itself and I can also include comments about what I'm doing. Then, I take a look at
the data again, and the entire process starts again. Once I've moved on to analyze the entire
collection of data, I might ind more issues or I might need to change the data somehow in order
to make it easier to analyze, and I'm back in the data cleansing loop once more.
Clojure is an excellent tool for this kind of work, because a REPL is a great environment to
explore data and ix it interactively. Also, because many of its sequence functions are lazy by
default, Clojure makes it easy to work with a lot of data.
This chapter will highlight a few of the many features that Clojure has to clean data. Initially,
we'll take a look at regular expressions and some other basic tools. Then, we'll move on to
how we can normalize speciic kinds of values. The next few recipes will turn our attention
to the process of how to handle very large data sets. Finally, we'll take a look at some more
sophisticated ways to ix data where we will write a simple spell checker and a custom parser.
Finally, the last recipe will introduce you to a Clojure library that has a good DSL to write tests
in order to validate your data.
Cleaning data with regular expressions
Often, cleaning data involves text transformations. Some, such as adding or removing a set
and static strings, are pretty simple. Others, such as parsing a complex data format such
as JSON or XML, requires a complete parser. However, many fall within a middle range of
complexity. These need more processing power than simple string manipulation, but full-
ledged parsing is too much. For these tasks, regular expressions are often useful.
Probably, the most basic and pervasive tool to clean data of any kind is a regular expression.
Although they're overused sometimes, regular expressions truly are the best tool for the job
many times. Moreover, Clojure has a built-in syntax for compiled regular expressions, so they
are convenient too.
In this example, we'll write a function that normalizes U.S. phone numbers.
Getting ready
For this recipe, we will only require a very basic project.clj ile. It should have these lines:
(defproject cleaning-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]])
 
Search WWH ::




Custom Search