Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

This is often an iterative, interactive process. If it's a very large dataset, I might create a sample

to work with at this stage. Generally, I start by examining the data iles. Once I ind a problem, I

try to code a solution, which I run on the dataset. After each change, I archive the data, either

using a ZIP ile or, if the data iles are small enough, a version control system. Using a version

control system is a good option because I can track the code to transform the data along with

the data itself and I can also include comments about what I'm doing. Then, I take a look at

the data again, and the entire process starts again. Once I've moved on to analyze the entire

collection of data, I might ind more issues or I might need to change the data somehow in order

to make it easier to analyze, and I'm back in the data cleansing loop once more.

Clojure is an excellent tool for this kind of work, because a REPL is a great environment to

explore data and ix it interactively. Also, because many of its sequence functions are lazy by

default, Clojure makes it easy to work with a lot of data.

This chapter will highlight a few of the many features that Clojure has to clean data. Initially,

we'll take a look at regular expressions and some other basic tools. Then, we'll move on to

how we can normalize speciic kinds of values. The next few recipes will turn our attention

to the process of how to handle very large data sets. Finally, we'll take a look at some more

sophisticated ways to ix data where we will write a simple spell checker and a custom parser.

Finally, the last recipe will introduce you to a Clojure library that has a good DSL to write tests

in order to validate your data.

Cleaning data with regular expressions

Often, cleaning data involves text transformations. Some, such as adding or removing a set

and static strings, are pretty simple. Others, such as parsing a complex data format such

as JSON or XML, requires a complete parser. However, many fall within a middle range of

complexity. These need more processing power than simple string manipulation, but full-

ledged parsing is too much. For these tasks, regular expressions are often useful.

Probably, the most basic and pervasive tool to clean data of any kind is a regular expression.

Although they're overused sometimes, regular expressions truly are the best tool for the job

many times. Moreover, Clojure has a built-in syntax for compiled regular expressions, so they

are convenient too.

In this example, we'll write a function that normalizes U.S. phone numbers.

Getting ready

For this recipe, we will only require a very basic project.clj ile. It should have these lines:

(defproject cleaning-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]])

Search WWH ::

Custom Search

Home