Database Reference
In-Depth Information
or as Prof. Deming puts it: “In God we trust; all others must bring
data”.
The data that is used to train the model is called the “training set”. In
the spam filtering example, a training set, that is represented as a database
of previous emails, can be extracted from the past experience. Possible
attributes that can be used as indicators for spam activity are: Number of
recipients, Size of message, Number of attachments, Number of times the
string “re” appears in the subject line, the country from which the email is
sent, etc.
The training set is usually represented as a table. Each row represents
a single email instance that was delivered via the mail server. Each column
corresponds to an attribute that characterizes the email instance (such as
the Number of recipients). In any supervised learning task and particularly
in classification tasks, one column corresponds to the target attribute that
we try to predict. In our example, the target attribute indicates if the email
is spam or ham (non-spam). All other columns hold the input attributes
that are used for making the predicted classification. There are three main
types of input attributes:
(1) Numeric — such as in the case of the attribute “Number of recipients”.
(2) Ordinal (or categorical) — that provides an order by which the data can
be sorted, but unlike the numeric type there is no notion of distance.
For example, the three classes of medal (Gold, Silver and Bronze) can
suit an ordinal attribute.
(3) Nominal — in which the values are merely distinct names or labels with
no meaningful order by which one can sort the data. For example, the
Gender attribute can be referred as a nominal value.
Since in many machine learning algorithms, the training set size and
the predictive performance are positively correlated, usually we will prefer
to use the largest possible training set. In practice, however, we might
want to limit the training set due to resource constraints. Having a large
training set implies that the training time will be long as well, therefore
we might select a sample of our data to fit our computational resources.
Moreover, data collection and particularly labeling the instances may come
with a price tag in terms of human effort. In our email filtering example,
labeling the training emails as either “spam” or “ham” is done manually
by the users and therefore it might be too expensive to label all the
emails.
Search WWH ::




Custom Search