Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Now, let's compare this to the results of training and testing on the training set (this is

what we were doing previously by training and testing on all data). Again, we will omit

the code as it is very similar (but it is available in the code bundle):

0.0 L2 regularization parameter, AUC = 66.260311%

0.001 L2 regularization parameter, AUC = 66.260311%

0.0025 L2 regularization parameter, AUC = 66.260311%

0.005 L2 regularization parameter, AUC = 66.238294%

0.01 L2 regularization parameter, AUC = 66.238294%

So, we can see that when we train and evaluate our model on the same dataset, we gener-

ally achieve the highest performance when regularization is lower. This is because our

model has seen all the data points, and with low levels of regularization, it can over-fit the

data set and achieve higher performance.

In contrast, when we train on one dataset and test on another, we see that generally a

slightly higher level of regularization results in better test set performance.

In cross-validation, we would typically find the parameter settings (including regulariza-

tion as well as the various other parameters such as step size and so on) that result in the

best test set performance. We would then use these parameter settings to retrain the model

on all of our data in order to use it to make predictions on new data.

Tip

Recall from Chapter 4 , Building a Recommendation Engine with Spark , that we did not

cover cross-validation. You can apply the same techniques we used earlier to split the rat-

ings dataset from that chapter into a training and test dataset. You can then try out differ-

ent parameter settings on the training set while evaluating the MSE and MAP perform-

ance metrics on the test set in a manner similar to what we did earlier. Give it a try!

Search WWH ::

Custom Search

Home