Database Reference
In-Depth Information
[text]
(re-seq #"[a-z]+" (string/lower-case text)))
2.
The training data structure is just a map of words and their frequencies:
(defn train
[feats] (frequencies feats))
3.
Now, we can train our spell checker. We'll use the dataset that Norvig has linked to
in his article ( http://norvig.com/big.txt ), which I've downloaded locally:
(def n-words
(train (words (slurp "data/big.txt"))))
(def alphabet
"abcdefghijklmnopqrstuvwxyz")
4. We need to deine some operations on the words in our training corpus:
(defn split-word [word i]
[(.substring word 0 i) (.substring word i)])
(defn delete-char [[w1 w2]]
(str w1 (.substring w2 1)))
(defn transpose-split [[w1 w2]]
(str w1 (second w2) (first w2) (.substring w2 2)))
(defn replace-split [[w1 w2]]
(let [w2-0 (.substring w2 1)]
(map #(str w1 % w2-0) alphabet)))
(defn insert-split [[w1 w2]]
(map #(str w1 % w2) alphabet))
5. We're now ready to deine the two functions that are the heart of the algorithm.
The irst function calculates all of the possible edits that can be made to a word,
based on the operators we just deined:
(defn edits-1 [word]
(let [splits (map (partial split-word word)
(range (inc (count word))))
long-splits (filter #(> (count (second %)) 1)
splits)
deletes (map delete-char long-splits)
transposes (map transpose-split long-splits)
replaces (mapcat replace-split long-splits)
inserts (remove nil?
(mapcat insert-split splits))]
(set (concat deletes transposes
replaces inserts))))
 
Search WWH ::




Custom Search