Database Reference
In-Depth Information
For data, we can get preclassiied emails from the SpamAssassin website. Take a look
at
https://spamassassin.apache.org/publiccorpus/
. From this directory,
I downloaded
20050311_spam_2.tar.bz2
,
20030228_easy_ham_2.tar.bz2
,
and
20030228_hard_ham.tar.bz2
. I decompressed these into the
training
directory.
This added three subdirectories:
training/easy_ham_2
,
training/hard_ham
,
and
training/spam_2
.
I also downloaded two other archives:
20021010_hard_ham.tar.bz2
and
20021010_
spam.tar.bz2
. I decompressed these into the
test-data
directory in order to create
the
test-data/hard_ham
and
test-data/spam
directories.
How to do it…
Now, we can deine the functions to create the processing pipeline and a list of document
instances, as well as to train the classiier and classify the documents:
1.
We'll create the processing pipeline separately. A single instance of this has to be
used to process all of the training, test, and actual data. Hang on to this:
(defn make-pipe-list []
(SerialPipes.
[(Target2Label.)
(SaveDataInSource.)
(Input2CharSequence. "UTF-8")
(CharSequence2TokenSequence.
#"\p{L}[\p{L}\p{P}]+\p{L}")
(TokenSequenceLowercase.)
(TokenSequenceRemoveStopwords.)
(TokenSequence2FeatureSequence.)
(FeatureSequence2AugmentableFeatureVector.
false)]))
2. We can use that to create the instance list over the iles in a directory. When we do,
we'll use the documents' parent directory's name as its classiication. This is what
we'll be training the classiier on:
(defn add-input-directory [dir-name pipe]
(doto (InstanceList. pipe)
(.addThruPipe
(FileIterator. (io/file dir-name)
#".*/([^/]*?)/\d+\..*$"))))