Database Reference
In-Depth Information
But we should also ask whether there is other information available that would help us
make a better decision about spam. For example, spam is often generated by particular
hosts, either those belonging to the spammers, or hosts that have been coopted into a “bot-
net” for the purpose of generating spam. Thus, including the originating host or originating
email address into the feature vector describing an email might enable us to design a better
classifier and lower the error rate.
Creating a Training Set
It is reasonable to ask where the label information that turns data into a training set comes
from. The obvious method is to create the labels by hand, having an expert look at each
feature vector and classify it properly. Recently, crowdsourcing techniques have been used
to label data. For example, in many applications it is possible to use Mechanical Turk to
label data. Since the “Turkers” are not necessarily reliable, it is wise to use a system that
allows the question to be asked of several different people, until a clear majority is in favor
of one label.
One often can find data on the Web that is implicitly labeled. For example, the Open
Directory (DMOZ) has millions of pages labeled by topic. That data, used as a training set,
can enable one to classify other pages or documents according to their topic, based on the
frequency of word occurrence. Another approach to classifying by topic is to look at the
Wikipedia page for a topic and see what pages it links to. Those pages can safely be as-
sumed to be relevant to the given topic.
In some applications we can use the stars that people use to rate products or services on
sites like Amazon or Yelp. For example, we might want to estimate the number of stars
that would be assigned to reviews or tweets about a product, even if those reviews do not
have star ratings. If we use star-labeled reviews as a training set, we can deduce the words
that are most commonly associated with positive and negative reviews (called sentiment
analysis ). The presence of these words in other reviews can tell us the sentiment of those
reviews.
12.1.5
Exercises for Section 12.1
EXERCISE 12.1.1 Redo Example 12.2 for the following different forms of f ( x ).
(a) Require f ( x ) = ax ; i.e., a straight line through the origin. Is the line
that we discussed
in the example optimal?
(b) Require f ( x ) to be a quadratic, i.e., f ( x ) = ax 2 + bx + c .
Search WWH ::




Custom Search