Database Reference
In-Depth Information
When it's an open domain, such as words in a free-text ield, the problem can be quite dificult.
However, when the data represents a limited vocabulary (such as US state names, for our
example here) there's a simple trick that can help. While it's common to use full state names,
standard postal codes are also often used. A mapping from common forms or mistakes to a
normalized form is an easy way to ix variants in a ield.
Getting ready
For the project.clj ile, we'll use a very simple coniguration:
(defproject cleaning-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]])
We just need to make sure that the clojure.string/upper-case function is available
to us:
(use '[clojure.string :only (upper-case)])
How to do it…
1. For this recipe, we'll deine the synonym map and a function to use it. Then, we'll
see it in action. We'll deine the mapping to a normalized form. I will not list all of
the states here, but you should get the idea:
(def state-synonyms
{"ALABAMA" "AL",
"ALASKA" "AK",
"ARIZONA" "AZ",
"WISCONSIN" "WI",
"WYOMING" "WY"})
2.
We'll wrap it in a function that makes the input uppercased before querying the
mapping, as shown here:
(defn normalize-state [state]
(let [uc-state (upper-case state)]
(state-synonyms uc-state uc-state)))
3.
Then, we just call normalize-state with the strings we want to ix:
user=> (map normalize-state
["Alabama" "OR" "Va" "Fla"])
("AL" "OR" "VA" "FL")
 
Search WWH ::




Custom Search