Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

For data, we can get preclassiied emails from the SpamAssassin website. Take a look

at https://spamassassin.apache.org/publiccorpus/ . From this directory,

I downloaded 20050311_spam_2.tar.bz2 , 20030228_easy_ham_2.tar.bz2 ,

and 20030228_hard_ham.tar.bz2 . I decompressed these into the training directory.

This added three subdirectories: training/easy_ham_2 , training/hard_ham ,

and training/spam_2 .

I also downloaded two other archives: 20021010_hard_ham.tar.bz2 and 20021010_

spam.tar.bz2 . I decompressed these into the test-data directory in order to create

the test-data/hard_ham and test-data/spam directories.

How to do it…

Now, we can deine the functions to create the processing pipeline and a list of document

instances, as well as to train the classiier and classify the documents:

1.

We'll create the processing pipeline separately. A single instance of this has to be

used to process all of the training, test, and actual data. Hang on to this:

(defn make-pipe-list []

(SerialPipes.

[(Target2Label.)

(SaveDataInSource.)

(Input2CharSequence. "UTF-8")

(CharSequence2TokenSequence.

#"\p{L}[\p{L}\p{P}]+\p{L}")

(TokenSequenceLowercase.)

(TokenSequenceRemoveStopwords.)

(TokenSequence2FeatureSequence.)

(FeatureSequence2AugmentableFeatureVector.

false)]))

2. We can use that to create the instance list over the iles in a directory. When we do,

we'll use the documents' parent directory's name as its classiication. This is what

we'll be training the classiier on:

(defn add-input-directory [dir-name pipe]

(doto (InstanceList. pipe)

(.addThruPipe

(FileIterator. (io/file dir-name)

#".*/([^/]*?)/\d+\..*$"))))

Search WWH ::

Custom Search

Home