Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Training a regression model on the bike sharing

dataset

We're ready to use the features we have extracted to train our models on the bike sharing

data. First, we'll train the linear regression model and take a look at the first few predictions

that the model makes on the data:

linear_model = LinearRegressionWithSGD.train(data,

iterations=10, step=0.1, intercept=False)

true_vs_predicted = data.map(lambda p: (p.label,

linear_model.predict(p.features)))

print "Linear Model predictions: " +

str(true_vs_predicted.take(5))

Note that we have not used the default settings for iterations and step here. We've

changed the number of iterations so that the model does not take too long to train. As for

the step size, you will see why this has been changed from the default a little later. You will

see the following output:

Linear Model predictions: [(16.0, 119.30920003093595),

(40.0, 116.95463511937379), (32.0, 116.57294610647752),

(13.0, 116.43535423855654), (1.0, 116.221247828503)]

Next, we will train the decision tree model simply using the default arguments to the

trainRegressor method (which equates to using a tree depth of 5). Note that we need

to pass in the other form of the dataset, data_dt , that we created from the raw feature

values (as opposed to the binary encoded features that we used for the preceding linear

model).

We also need to pass in an argument for categoricalFeaturesInfo . This is a dic-

tionary that maps the categorical feature index to the number of categories for the feature.

If a feature is not in this mapping, it will be treated as continuous. For our purposes, we

will leave this as is, passing in an empty mapping:

dt_model = DecisionTree.trainRegressor(data_dt,{})

preds = dt_model.predict(data_dt.map(lambda p: p.features))

actual = data.map(lambda p: p.label)

true_vs_predicted_dt = actual.zip(preds)

Search WWH ::

Custom Search

Home