Training Decision Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

or as Prof. Deming puts it: “In God we trust; all others must bring

data”.

The data that is used to train the model is called the “training set”. In

the spam filtering example, a training set, that is represented as a database

of previous emails, can be extracted from the past experience. Possible

attributes that can be used as indicators for spam activity are: Number of

recipients, Size of message, Number of attachments, Number of times the

string “re” appears in the subject line, the country from which the email is

sent, etc.

The training set is usually represented as a table. Each row represents

a single email instance that was delivered via the mail server. Each column

corresponds to an attribute that characterizes the email instance (such as

the Number of recipients). In any supervised learning task and particularly

in classification tasks, one column corresponds to the target attribute that

we try to predict. In our example, the target attribute indicates if the email

is spam or ham (non-spam). All other columns hold the input attributes

that are used for making the predicted classification. There are three main

types of input attributes:

(1) Numeric — such as in the case of the attribute “Number of recipients”.

(2) Ordinal (or categorical) — that provides an order by which the data can

be sorted, but unlike the numeric type there is no notion of distance.

For example, the three classes of medal (Gold, Silver and Bronze) can

suit an ordinal attribute.

(3) Nominal — in which the values are merely distinct names or labels with

no meaningful order by which one can sort the data. For example, the

Gender attribute can be referred as a nominal value.

Since in many machine learning algorithms, the training set size and

the predictive performance are positively correlated, usually we will prefer

to use the largest possible training set. In practice, however, we might

want to limit the training set due to resource constraints. Having a large

training set implies that the training time will be long as well, therefore

we might select a sample of our data to fit our computational resources.

Moreover, data collection and particularly labeling the instances may come

with a price tag in terms of human effort. In our email filtering example,

labeling the training emails as either “spam” or “ham” is done manually

by the users and therefore it might be too expensive to label all the

emails.

Search WWH ::

Custom Search

Home