Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

model = LinearRegressionWithSGD.train(train,

iterations, step, regParam=regParam, regType=regType,

intercept=intercept)

tp = test.map(lambda p: (p.label,

model.predict(p.features)))

rmsle = np.sqrt(tp.map(lambda (t, p):

squared_log_error(t, p)).mean())

return rmsle

Tip

Note that in the following sections, you might get slightly different results due to some

random initialization for SGD. However, your results will be comparable.

Iterations

As we saw when evaluating our classification models, we generally expect that a model

trained with SGD will achieve better performance as the number of iterations increases,

although the increase in performance will slow down as the number of iterations goes

above some minimum number. Note that here, we will set the step size to 0.01 to better il-

lustrate the impact at higher iteration numbers:

params = [1, 5, 10, 20, 50, 100]

metrics = [evaluate(train_data, test_data, param , 0.01,

0.0, 'l2', False) for param in params]

print params

print metrics

The output shows that the error metric indeed decreases as the number of iterations in-

creases. It also does so at a decreasing rate, again as expected. What is interesting is that

eventually, the SGD optimization tends to overshoot the optimal solution, and the RMSLE

eventually starts to increase slightly:

[1, 5, 10, 20, 50, 100]

[2.3532904530306888, 1.6438528499254723,

1.4869656275309227, 1.4149741941240344, 1.4159641262731959,

1.4539667094611679]

Search WWH ::

Custom Search

Home