Database Reference
In-Depth Information
How to do it…
In order to create vectors of all the documents, we'll irst need to create a token index that
maps tokens to the indexes in the vector. We'll then use that to create a sequence of Colt
vectors. Finally, we can load the SOTU addresses and generate sparse feature vectors of all
the documents, as follows:
1.
Before we can create the feature vectors, we need to have a token index so that the
vector indexes will be consistent across all of the documents. The build-index
function takes care of this:
(defn build-index [corpus]
(into {}
(zipmap (tfidf/get-corpus-terms corpus)
(range))))
2.
Now, we can use build-index to convert a sequence of token-frequency pairs into
a feature vector. All of the tokens must be in the index:
(defn ->matrix [index pairs]
(let [matrix (.make DoubleFactory2D/sparse
1 (count index) 0.0)
inc-cell (fn [m p]
(let [[k v] p,
i (index k)]
(.set m 0 i v)
m))]
(reduce inc-cell matrix pairs)))
With these in place, let's make use of them by loading the token frequencies in a corpus and
then create the index from this:
(def corpus
(->> "sotu"
(java.io.File.)
(.list)
(map #(str "sotu/" %))
(map slurp)
(map tokenize)
(map normalize)
(map frequencies)))
(def index (build-index corpus))
With the index, we can inally move the information of the document frequencies into
sparse vectors:
(def vecs (map #(->matrix index %) corpus))
 
Search WWH ::




Custom Search