Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

out fold. This is repeated K times, and the results are averaged to give the cross-validation

score. The train-test split is effectively like two-fold cross-validation.

Other approaches include leave-one-out cross-validation and random sampling. See the

article at http://en.wikipedia.org/wiki/Cross-validation_(statistics) for further details.

First, we will split our dataset into a 60 percent training set and a 40 percent test set (we

will use a constant random seed of 123 here to ensure that we get the same results for ease

of illustration):

val trainTestSplit = scaledDataCats.randomSplit(Array(0.6,

0.4), 123)

val train = trainTestSplit(0)

val test = trainTestSplit(1)

Next, we will compute the evaluation metric of interest (again, we will use AUC) for a

range of regularization parameter settings. Note that here we will use a finer-grained step

size between the evaluated regularization parameters to better illustrate the differences in

AUC, which are very small in this case:

val regResultsTest = Seq(0.0, 0.001, 0.0025, 0.005,

0.01).map { param =>

val model = trainWithParams( train , param, numIterations,

new SquaredL2Updater, 1.0)

createMetrics(s"$param L2 regularization parameter",

test, model)

}

regResultsTest.foreach { case (param, auc) =>

println(f"$param, AUC = ${auc * 100}%2.6f%%")

}

This will compute the results of training on the training set and the results of evaluating

on the test set, as shown here:

0.0 L2 regularization parameter, AUC = 66.480874%

0.001 L2 regularization parameter, AUC = 66.480874%

0.0025 L2 regularization parameter, AUC = 66.515027%

0.005 L2 regularization parameter, AUC = 66.515027%

0.01 L2 regularization parameter, AUC = 66.549180%

Search WWH ::

Custom Search

Home