Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Cross-validation

So far in this topic, we have only briefly mentioned the idea of cross-validation and out-of-

sample testing. Cross-validation is a critical part of real-world machine learning and is

central to many model selection and parameter tuning pipelines.

The general idea behind cross-validation is that we want to know how our model will per-

form on unseen data. Evaluating this on real, live data (for example, in a production sys-

tem) is risky, because we don't really know whether the trained model is the best in the

sense of being able to make accurate predictions on new data. As we saw previously with

regard to regularization, our model might have over-fit the training data and be poor at

making predictions on data it has not been trained on.

Cross-validation provides a mechanism where we use part of our available dataset to train

our model and another part to evaluate the performance of this model. As the model is

tested on data that it has not seen during the training phase, its performance, when evalu-

ated on this part of the dataset, gives us an estimate as to how well our model generalizes

for the new data points.

Here, we will implement a simple cross-validation evaluation approach using a train-test

split. We will divide our dataset into two non-overlapping parts. The first dataset is used to

train our model and is called the training set. The second dataset, called the test set or hold-

out set, is used to evaluate the performance of our model using our chosen evaluation

measure. Common splits used in practice include 50/50, 60/40, and 80/20 splits, but you

can use any split as long as the training set is not too small for the model to learn (gener-

ally, at least 50 percent is a practical minimum).

In many cases, three sets are created: a training set, an evaluation set (which is used like the

above test set to tune the model parameters such as lambda and step size), and a test set

(which is never used to train a model or tune any parameters, but is only used to generate

an estimated true performance on completely unseen data).

Note

Here, we will explore a simple train-test split approach. There are many cross-validation

techniques that are more exhaustive and complex.

One popular example is K-fold cross-validation, where the dataset is split into K non-over-

lapping folds. The model is trained on K-1 folds of data and tested on the remaining, held-

Search WWH ::

Custom Search

Home