Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

print params

print metrics

The output of the preceding code:

[0.01, 0.025, 0.05, 0.1, 0.5]

[1.4869656275309227, 1.4189071944747715,

1.5027293911925559, 1.5384660954019973, nan]

Now, we can see why we avoided using the default step size when training the linear mod-

el originally. The default is set to 1.0 , which, in this case, results in a nan output for the

RMSLE metric. This typically means that the SGD model has converged to a very poor

local minimum in the error function that it is optimizing. This can happen when the step

size is relatively large, as it is easier for the optimization algorithm to overshoot good

solutions.

We can also see that for low step sizes and a relatively low number of iterations (we used

10 here), the model performance is slightly poorer. However, in the preceding Iterations

section, we saw that for the lower step-size setting, a higher number of iterations will gen-

erally converge to a better solution.

Generally speaking, setting step size and number of iterations involves a trade-off. A

lower step size means that convergence is slower but slightly more assured. However, it

requires a higher number of iterations, which is more costly in terms of computation and

time, in particular at a very large scale.

Tip

Selecting the best parameter settings can be an intensive process that involves training a

model on many combinations of parameter settings and selecting the best outcome. Each

instance of model training involves a number of iterations, so this process can be very ex-

pensive and time consuming when performed on very large datasets.

The output is plotted here, again using a log scale for the step-size axis:

Search WWH ::

Custom Search

Home