Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

data_log = data.map(lambda lp:

LabeledPoint(np.log(lp.label), lp.features))

We will then train a model on this transformed data and form the RDD of predicted versus

true values:

model_log = LinearRegressionWithSGD.train(data_log,

iterations=10, step=0.1)

Note that now that we have transformed the target variable, the predictions of the model

will be on the log scale, as will the target values of the transformed dataset. Therefore, in

order to use our model and evaluate its performance, we must first transform the log data

back into the original scale by taking the exponent of both the predicted and true values

using the numpy exp function. We will show you how to do this in the code here:

true_vs_predicted_log = data_log.map(lambda p:

(np.exp(p.label), np.exp(model_log.predict(p.features))))

Finally, we will compute the MSE, MAE, and RMSLE metrics for the model:

mse_log = true_vs_predicted_log.map(lambda (t, p):

squared_error(t, p)).mean()

mae_log = true_vs_predicted_log.map(lambda (t, p):

abs_error(t, p)).mean()

rmsle_log = np.sqrt(true_vs_predicted_log.map(lambda (t,

p): squared_log_error(t, p)).mean())

print "Mean Squared Error: %2.4f" % mse_log

print "Mean Absolue Error: %2.4f" % mae_log

print "Root Mean Squared Log Error: %2.4f" % rmsle_log

print "Non log-transformed predictions:\n" +

str(true_vs_predicted.take(3))

print "Log-transformed predictions:\n" +

str(true_vs_predicted_log.take(3))

You should see output similar to the following:

Mean Squared Error: 38606.0875

Mean Absolue Error: 135.2726

Root Mean Squared Log Error: 1.3516

Non log-transformed predictions:

Search WWH ::

Custom Search

Home