Database Reference
In-Depth Information
How it works…
The only wrinkle here is that we have to normalize the input a little by making sure that it's
uppercased before we can apply the mapping of synonyms to it. Otherwise, we'd also need
to have an entry for any possible way in which the input can be capitalized.
See also
F The Fixing spelling errors recipe later in this chapter
Identifying and removing duplicate data
One problem when cleaning up data is dealing with duplicates. How do we ind them? What
do we do with them once we have them? While a part of this process can be automated, often
merging duplicated data is a manual task, because a person has to look at potential matches
and determine whether they are duplicates or not and determining what needs to be done
with the overlapping data. We can code heuristics, of course, but at some point, a person
needs to make the inal call.
The irst question that needs to be answered is what constitutes identity for the data. If you
have two items of data, which ields do you have to look at in order to determine whether
they are duplicates? Then, you must determine how close they need to be.
For this recipe, we'll examine some data and decide on duplicates by doing a fuzzy
comparison of the name ields. We'll simply return all of the pairs that appear to be duplicates.
Getting ready
First, we need to add the library to do the fuzzy string matching to our Leiningen
project.clj ile:
(defproject cleaning-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[clj-diff "1.0.0-SNAPSHOT"]])
And to make sure that's available to our script or REPL:
(use 'clj-diff.core)
 
Search WWH ::




Custom Search