Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Computing performance metrics on the bike

sharing dataset

Given the functions we defined earlier, we can now compute the various evaluation metrics

on our bike sharing data.

Linear model

Our approach will be to apply the relevant error function to each record in the RDD we

computed earlier, which is true_vs_predicted for our linear model:

mse = true_vs_predicted.map(lambda (t, p): squared_error(t,

p)).mean()

mae = true_vs_predicted.map(lambda (t, p): abs_error(t,

p)).mean()

rmsle = np.sqrt(true_vs_predicted.map(lambda (t, p):

squared_log_error(t, p)).mean())

print "Linear Model - Mean Squared Error: %2.4f" % mse

print "Linear Model - Mean Absolute Error: %2.4f" % mae

print "Linear Model - Root Mean Squared Log Error: %2.4f" %

rmsle

This outputs the following metrics:

Linear Model - Mean Squared Error: 28166.3824

Linear Model - Mean Absolute Error: 129.4506

Linear Model - Root Mean Squared Log Error: 1.4974

Decision tree

We will use the same approach for the decision tree model, using the

true_vs_predicted_dt RDD:

mse_dt = true_vs_predicted_dt.map(lambda (t, p):

squared_error(t, p)).mean()

mae_dt = true_vs_predicted_dt.map(lambda (t, p):

abs_error(t, p)).mean()

rmsle_dt = np.sqrt(true_vs_predicted_dt.map(lambda (t, p):

squared_log_error(t, p)).mean())

Search WWH ::

Custom Search

Home