Database Reference
In-Depth Information
data_log = data.map(lambda lp:
LabeledPoint(np.log(lp.label), lp.features))
We will then train a model on this transformed data and form the RDD of predicted versus
true values:
model_log = LinearRegressionWithSGD.train(data_log,
iterations=10, step=0.1)
Note that now that we have transformed the target variable, the predictions of the model
will be on the log scale, as will the target values of the transformed dataset. Therefore, in
order to use our model and evaluate its performance, we must first transform the log data
back into the original scale by taking the exponent of both the predicted and true values
using the numpy exp function. We will show you how to do this in the code here:
true_vs_predicted_log = data_log.map(lambda p:
(np.exp(p.label), np.exp(model_log.predict(p.features))))
Finally, we will compute the MSE, MAE, and RMSLE metrics for the model:
mse_log = true_vs_predicted_log.map(lambda (t, p):
squared_error(t, p)).mean()
mae_log = true_vs_predicted_log.map(lambda (t, p):
abs_error(t, p)).mean()
rmsle_log = np.sqrt(true_vs_predicted_log.map(lambda (t,
p): squared_log_error(t, p)).mean())
print "Mean Squared Error: %2.4f" % mse_log
print "Mean Absolue Error: %2.4f" % mae_log
print "Root Mean Squared Log Error: %2.4f" % rmsle_log
print "Non log-transformed predictions:\n" +
str(true_vs_predicted.take(3))
print "Log-transformed predictions:\n" +
str(true_vs_predicted_log.take(3))
You should see output similar to the following:
Mean Squared Error: 38606.0875
Mean Absolue Error: 135.2726
Root Mean Squared Log Error: 1.3516
Non log-transformed predictions:
Search WWH ::




Custom Search