Database Reference
In-Depth Information
model = LinearRegressionWithSGD.train(train,
iterations, step, regParam=regParam, regType=regType,
intercept=intercept)
tp = test.map(lambda p: (p.label,
model.predict(p.features)))
rmsle = np.sqrt(tp.map(lambda (t, p):
squared_log_error(t, p)).mean())
return rmsle
Tip
Note that in the following sections, you might get slightly different results due to some
random initialization for SGD. However, your results will be comparable.
Iterations
As we saw when evaluating our classification models, we generally expect that a model
trained with SGD will achieve better performance as the number of iterations increases,
although the increase in performance will slow down as the number of iterations goes
above some minimum number. Note that here, we will set the step size to 0.01 to better il-
lustrate the impact at higher iteration numbers:
params = [1, 5, 10, 20, 50, 100]
metrics = [evaluate(train_data, test_data, param , 0.01,
0.0, 'l2', False) for param in params]
print params
print metrics
The output shows that the error metric indeed decreases as the number of iterations in-
creases. It also does so at a decreasing rate, again as expected. What is interesting is that
eventually, the SGD optimization tends to overshoot the optimal solution, and the RMSLE
eventually starts to increase slightly:
[1, 5, 10, 20, 50, 100]
[2.3532904530306888, 1.6438528499254723,
1.4869656275309227, 1.4149741941240344, 1.4159641262731959,
1.4539667094611679]
Search WWH ::




Custom Search