Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

We should remember that the purpose of creating a model or classifier is not to classify the training set, but to classify

the data whose class we do not know. We want that data to be classified correctly, but often we have no way of know-

ing whether or not the model does so. If the nature of the data changes over time - for instance, if we are trying to

detect spam emails - then we need to measure the performance over time as best we can. For example, in the case of

spam emails, we can note the rate of reports of spam emails that were not classified as spam.

Batch Versus On-Line Learning

Often, as in Examples 12.1 and 12.2 , we use a batch learning architecture. That is, the en-

tire training set is available at the beginning of the process, and it is all used in whatever

way the algorithm requires to produce a model once and for all. The alternative is on-line

learning , where the training set arrives in a stream and, like any stream, cannot be revisited

after it is processed. In on-line learning, we maintain a model at all times. As new train-

ing examples arrive, we may choose to modify the model to account for the new examples.

On-line learning has the advantages that it can

(1) Deal with very large training sets, because it does not access more than one training

example at a time.

(2) Adapt to changes in the population of training examples as time goes on. For instance,

Google trains its spam-email classifier this way, adapting the classifier for spam as

new kinds of spam email are sent by spammers and indicated to be spam by the recip-

ients.

An enhancement of on-line learning, suitable in some cases, is active learning . Here, the

classifier may receive some training examples, but it primarily receives unclassified data,

which it must classify. If the classifier is unsure of the classification (e.g., the newly ar-

rived example is very close to the boundary), then the classifier can ask for ground truth

at some significant cost. For instance, it could send the example to Mechanical Turk and

gather opinions of real people. In this way, examples near the boundary become training

examples and can be used to modify the classifier.

Feature Selection

Sometimes, the hardest part of designing a good model or classifier is figuring out what

features to use as input to the learning algorithm. Let us reconsider Example 12.3 , where

we suggested that we could classify emails as spam or not by looking at the words con-

tained in the email. In fact we explore in detail such a classifier in Example 12.4 . As dis-

cussed in Example 12.3 , it may make sense to focus on certain words and not others; e.g.,

we should eliminate stop words.

Search WWH ::

Custom Search

Home