Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

[text]

(re-seq #"[a-z]+" (string/lower-case text)))

2.

The training data structure is just a map of words and their frequencies:

(defn train

[feats] (frequencies feats))

3.

Now, we can train our spell checker. We'll use the dataset that Norvig has linked to

in his article ( http://norvig.com/big.txt ), which I've downloaded locally:

(def n-words

(train (words (slurp "data/big.txt"))))

(def alphabet

"abcdefghijklmnopqrstuvwxyz")

4. We need to deine some operations on the words in our training corpus:

(defn split-word [word i]

[(.substring word 0 i) (.substring word i)])

(defn delete-char [[w1 w2]]

(str w1 (.substring w2 1)))

(defn transpose-split [[w1 w2]]

(str w1 (second w2) (first w2) (.substring w2 2)))

(defn replace-split [[w1 w2]]

(let [w2-0 (.substring w2 1)]

(map #(str w1 % w2-0) alphabet)))

(defn insert-split [[w1 w2]]

(map #(str w1 % w2) alphabet))

5. We're now ready to deine the two functions that are the heart of the algorithm.

The irst function calculates all of the possible edits that can be made to a word,

based on the operators we just deined:

(defn edits-1 [word]

(let [splits (map (partial split-word word)

(range (inc (count word))))

long-splits (filter #(> (count (second %)) 1)

splits)

deletes (map delete-char long-splits)

transposes (map transpose-split long-splits)

replaces (mapcat replace-split long-splits)

inserts (remove nil?

(mapcat insert-split splits))]

(set (concat deletes transposes

replaces inserts))))

Search WWH ::

Custom Search

Home