Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

So, we can see that the number of iterations has minor impact on the results once a certain

number of iterations have been completed.

Step size

In SGD, the step size parameter controls how far in the direction of the steepest gradient

the algorithm takes a step when updating the model weight vector after each training ex-

ample. A larger step size might speed up convergence, but a step size that is too large

might cause problems with convergence as good solutions are overshot.

We can see the impact of changing the step size here:

val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map {

param =>

val model = trainWithParams(scaledDataCats, 0.0,

numIterations, new SimpleUpdater, param)

createMetrics(s"$param step size", scaledDataCats, model)

}

stepResults.foreach { case (param, auc) =>

println(f"$param, AUC = ${auc * 100}%2.2f%%") }

This will give us the following results, which show that increasing the step size too much

can begin to negatively impact performance.

0.001 step size, AUC = 64.95%

0.01 step size, AUC = 65.00%

0.1 step size, AUC = 65.52%

1.0 step size, AUC = 66.55%

10.0 step size, AUC = 61.92%

Regularization

We briefly touched on the Updater class in the preceding logistic regression code. An

Updater class in MLlib implements regularization. Regularization can help avoid over-

fitting of a model to training data by effectively penalizing model complexity. This can be

done by adding a term to the loss function that acts to increase the loss as a function of the

model weight vector.

Regularization is almost always required in real use cases, but is of particular importance

when the feature dimension is very high (that is, the effective number of variable weights

that can be learned is high) relative to the number of training examples.

Search WWH ::

Custom Search

Home