Large-Scale Machine Learning - Mining of Massive Datasets

Database Reference

In-Depth Information

12.1.4

Machine-Learning Architecture

Machine-learning algorithms can be classified not only by their general algorithmic ap-

proach as we discussed in Section 12.1.3 , but also by their underlying architecture - the

way data is handled and the way it is used to build the model.

Training and Testing

One general issue regarding the handling of data is that there is a good reason to withhold

some of the available data from the training set. The remaining data is called the test set .

The problem addressed is that many machine-learning algorithms tend to overfit the data;

they pick up on artifacts that occur in the training set but that are atypical of the larger pop-

ulation of possible data. By using the test data, and seeing how well the classifier works on

that, we can tell if the classifier is overfitting the data. If so, we can restrict the machine-

learning algorithm in some way. For instance, if we are constructing a decision tree, we can

limit the number of levels of the tree.

Figure 12.3 illustrates the train-and-test architecture. We assume all the data is suitable

for training (i.e., the class information is attached to the data), but we separate out a small

fraction of the available data as the test set. We use the remaining data to build a suitable

model or classifier. Then we feed the test data to this model. Since we know the class of

each element of the test data, we can tell how well the model does on the test data. If the

error rate on the test data is not much worse than the error rate of the model on the training

data itself, then we expect there is little, if any, overfitting, and the model can be used. On

the other hand, if the classifier performs much worse on the test data than on the training

data, we expect there is overfitting and need to rethink the way we construct the classifier.

Figure 12.3 The training set helps build the model, and the test set validates it

There is nothing special about the selection of the test data. In fact, we can repeat the

train-then-test process several times using the same data, if we divide the data into k equal-

sized chunks. In turn, we let each chunk be the test data, and use the remaining k −1 chunks

as the training data. This training architecture is called cross-validation .

Generalization

Search WWH ::

Custom Search

Home