Database Reference
In-Depth Information
For data, we can get preclassiied emails from the SpamAssassin website. Take a look
at https://spamassassin.apache.org/publiccorpus/ . From this directory,
I downloaded 20050311_spam_2.tar.bz2 , 20030228_easy_ham_2.tar.bz2 ,
and 20030228_hard_ham.tar.bz2 . I decompressed these into the training directory.
This added three subdirectories: training/easy_ham_2 , training/hard_ham ,
and training/spam_2 .
I also downloaded two other archives: 20021010_hard_ham.tar.bz2 and 20021010_
spam.tar.bz2 . I decompressed these into the test-data directory in order to create
the test-data/hard_ham and test-data/spam directories.
How to do it…
Now, we can deine the functions to create the processing pipeline and a list of document
instances, as well as to train the classiier and classify the documents:
1.
We'll create the processing pipeline separately. A single instance of this has to be
used to process all of the training, test, and actual data. Hang on to this:
(defn make-pipe-list []
(SerialPipes.
[(Target2Label.)
(SaveDataInSource.)
(Input2CharSequence. "UTF-8")
(CharSequence2TokenSequence.
#"\p{L}[\p{L}\p{P}]+\p{L}")
(TokenSequenceLowercase.)
(TokenSequenceRemoveStopwords.)
(TokenSequence2FeatureSequence.)
(FeatureSequence2AugmentableFeatureVector.
false)]))
2. We can use that to create the instance list over the iles in a directory. When we do,
we'll use the documents' parent directory's name as its classiication. This is what
we'll be training the classiier on:
(defn add-input-directory [dir-name pipe]
(doto (InstanceList. pipe)
(.addThruPipe
(FileIterator. (io/file dir-name)
#".*/([^/]*?)/\d+\..*$"))))
 
Search WWH ::




Custom Search