Database Reference
In-Depth Information
Training a regression model on the bike sharing
dataset
We're ready to use the features we have extracted to train our models on the bike sharing
data. First, we'll train the linear regression model and take a look at the first few predictions
that the model makes on the data:
linear_model = LinearRegressionWithSGD.train(data,
iterations=10, step=0.1, intercept=False)
true_vs_predicted = data.map(lambda p: (p.label,
linear_model.predict(p.features)))
print "Linear Model predictions: " +
str(true_vs_predicted.take(5))
Note that we have not used the default settings for iterations and step here. We've
changed the number of iterations so that the model does not take too long to train. As for
the step size, you will see why this has been changed from the default a little later. You will
see the following output:
Linear Model predictions: [(16.0, 119.30920003093595),
(40.0, 116.95463511937379), (32.0, 116.57294610647752),
(13.0, 116.43535423855654), (1.0, 116.221247828503)]
Next, we will train the decision tree model simply using the default arguments to the
trainRegressor method (which equates to using a tree depth of 5). Note that we need
to pass in the other form of the dataset, data_dt , that we created from the raw feature
values (as opposed to the binary encoded features that we used for the preceding linear
model).
We also need to pass in an argument for categoricalFeaturesInfo . This is a dic-
tionary that maps the categorical feature index to the number of categories for the feature.
If a feature is not in this mapping, it will be treated as continuous. For our purposes, we
will leave this as is, passing in an empty mapping:
dt_model = DecisionTree.trainRegressor(data_dt,{})
preds = dt_model.predict(data_dt.map(lambda p: p.features))
actual = data.map(lambda p: p.label)
true_vs_predicted_dt = actual.zip(preds)
Search WWH ::




Custom Search