Database Reference
In-Depth Information
12.1.4
Machine-Learning Architecture
Machine-learning algorithms can be classified not only by their general algorithmic ap-
proach as we discussed in Section 12.1.3 , but also by their underlying architecture - the
way data is handled and the way it is used to build the model.
Training and Testing
One general issue regarding the handling of data is that there is a good reason to withhold
some of the available data from the training set. The remaining data is called the test set .
The problem addressed is that many machine-learning algorithms tend to overfit the data;
they pick up on artifacts that occur in the training set but that are atypical of the larger pop-
ulation of possible data. By using the test data, and seeing how well the classifier works on
that, we can tell if the classifier is overfitting the data. If so, we can restrict the machine-
learning algorithm in some way. For instance, if we are constructing a decision tree, we can
limit the number of levels of the tree.
Figure 12.3 illustrates the train-and-test architecture. We assume all the data is suitable
for training (i.e., the class information is attached to the data), but we separate out a small
fraction of the available data as the test set. We use the remaining data to build a suitable
model or classifier. Then we feed the test data to this model. Since we know the class of
each element of the test data, we can tell how well the model does on the test data. If the
error rate on the test data is not much worse than the error rate of the model on the training
data itself, then we expect there is little, if any, overfitting, and the model can be used. On
the other hand, if the classifier performs much worse on the test data than on the training
data, we expect there is overfitting and need to rethink the way we construct the classifier.
Figure 12.3 The training set helps build the model, and the test set validates it
There is nothing special about the selection of the test data. In fact, we can repeat the
train-then-test process several times using the same data, if we divide the data into k equal-
sized chunks. In turn, we let each chunk be the test data, and use the remaining k −1 chunks
as the training data. This training architecture is called cross-validation .
Generalization
Search WWH ::




Custom Search