Database Reference
In-Depth Information
We should remember that the purpose of creating a model or classifier is not to classify the training set, but to classify
the data whose class we do not know. We want that data to be classified correctly, but often we have no way of know-
ing whether or not the model does so. If the nature of the data changes over time - for instance, if we are trying to
detect spam emails - then we need to measure the performance over time as best we can. For example, in the case of
spam emails, we can note the rate of reports of spam emails that were not classified as spam.
Batch Versus On-Line Learning
Often, as in Examples 12.1 and 12.2 , we use a batch learning architecture. That is, the en-
tire training set is available at the beginning of the process, and it is all used in whatever
way the algorithm requires to produce a model once and for all. The alternative is on-line
learning , where the training set arrives in a stream and, like any stream, cannot be revisited
after it is processed. In on-line learning, we maintain a model at all times. As new train-
ing examples arrive, we may choose to modify the model to account for the new examples.
On-line learning has the advantages that it can
(1) Deal with very large training sets, because it does not access more than one training
example at a time.
(2) Adapt to changes in the population of training examples as time goes on. For instance,
Google trains its spam-email classifier this way, adapting the classifier for spam as
new kinds of spam email are sent by spammers and indicated to be spam by the recip-
ients.
An enhancement of on-line learning, suitable in some cases, is active learning . Here, the
classifier may receive some training examples, but it primarily receives unclassified data,
which it must classify. If the classifier is unsure of the classification (e.g., the newly ar-
rived example is very close to the boundary), then the classifier can ask for ground truth
at some significant cost. For instance, it could send the example to Mechanical Turk and
gather opinions of real people. In this way, examples near the boundary become training
examples and can be used to modify the classifier.
Feature Selection
Sometimes, the hardest part of designing a good model or classifier is figuring out what
features to use as input to the learning algorithm. Let us reconsider Example 12.3 , where
we suggested that we could classify emails as spam or not by looking at the words con-
tained in the email. In fact we explore in detail such a classifier in Example 12.4 . As dis-
cussed in Example 12.3 , it may make sense to focus on certain words and not others; e.g.,
we should eliminate stop words.
Search WWH ::




Custom Search