Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

How it works…

The only wrinkle here is that we have to normalize the input a little by making sure that it's

uppercased before we can apply the mapping of synonyms to it. Otherwise, we'd also need

to have an entry for any possible way in which the input can be capitalized.

See also

F The Fixing spelling errors recipe later in this chapter

Identifying and removing duplicate data

One problem when cleaning up data is dealing with duplicates. How do we ind them? What

do we do with them once we have them? While a part of this process can be automated, often

merging duplicated data is a manual task, because a person has to look at potential matches

and determine whether they are duplicates or not and determining what needs to be done

with the overlapping data. We can code heuristics, of course, but at some point, a person

needs to make the inal call.

The irst question that needs to be answered is what constitutes identity for the data. If you

have two items of data, which ields do you have to look at in order to determine whether

they are duplicates? Then, you must determine how close they need to be.

For this recipe, we'll examine some data and decide on duplicates by doing a fuzzy

comparison of the name ields. We'll simply return all of the pairs that appear to be duplicates.

Getting ready

First, we need to add the library to do the fuzzy string matching to our Leiningen

project.clj ile:

(defproject cleaning-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]

[clj-diff "1.0.0-SNAPSHOT"]])

And to make sure that's available to our script or REPL:

(use 'clj-diff.core)

Search WWH ::

Custom Search

Home