Database Reference
In-Depth Information
Getting ready
For this, we'll need usual dependencies:
(defproject statim "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
[incanter "1.5.5"]])
We'll also require those in our script or REPL:
(require '[incanter.core :as i]
'[incanter.stats :as s]
'[incanter.charts :as c]
'[clojure.string :as str])
For this recipe, we'll look at Sir Arthur Conan Doyle's Sherlock Holmes stories. You can
download this from Project Gutenberg at http://www.gutenberg.org/cache/
epub/1661/pg1661.txt or http://www.ericrochester.com/clj-data-analysis/
data/pg1661.txt .
How to do it…
We'll look at the distribution of baker over the course of the topics. This may give some
indication of how important Holmes' residence at 221B Baker Street is for a given story.
1. First, we'll deine a function that takes a text string and pulls the words out of it,
or tokenizes it:
(defn tokenize
[text]
(map str/lower-case (re-seq #"\w+" text)))
2.
Next, we'll write a function that takes an item and a collection and returns how
many times the item appears in the collection:
(defn count-hits
[x coll]
(get (frequencies coll) x 0))
3. Now we can read the ile, tokenize it, and break it into overlapping windows of
500 tokens:
(def data-file "data/pg1661.txt")
(def windows
(partition 500 250 (tokenize (slurp data-file))))
 
Search WWH ::




Custom Search