Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

[(16.0, 119.30920003093594), (40.0, 116.95463511937378),

(32.0, 116.57294610647752)]

Log-transformed predictions:

[(15.999999999999998, 45.860944832110015), (40.0,

43.255903592233274), (32.0, 42.311306147884252)]

If we compare these results to the results on the raw target variable, we see that while we

did not improve the MSE or MAE, we improved the RMSLE.

We will perform the same analysis for the decision tree model:

data_dt_log = data_dt.map(lambda lp:

LabeledPoint(np.log(lp.label), lp.features))

dt_model_log = DecisionTree.trainRegressor(data_dt_log,{})

preds_log = dt_model_log.predict(data_dt_log.map(lambda p:

p.features))

actual_log = data_dt_log.map(lambda p: p.label)

true_vs_predicted_dt_log =

actual_log.zip(preds_log).map(lambda (t, p): (np.exp(t),

np.exp(p)))

mse_log_dt = true_vs_predicted_dt_log.map(lambda (t, p):

squared_error(t, p)).mean()

mae_log_dt = true_vs_predicted_dt_log.map(lambda (t, p):

abs_error(t, p)).mean()

rmsle_log_dt = np.sqrt(true_vs_predicted_dt_log.map(lambda

(t, p): squared_log_error(t, p)).mean())

print "Mean Squared Error: %2.4f" % mse_log_dt

print "Mean Absolue Error: %2.4f" % mae_log_dt

print "Root Mean Squared Log Error: %2.4f" % rmsle_log_dt

print "Non log-transformed predictions:\n" +

str(true_vs_predicted_dt.take(3))

print "Log-transformed predictions:\n" +

str(true_vs_predicted_dt_log.take(3))

From the results here, we can see that we actually made our metrics slightly worse for the

decision tree:

Search WWH ::

Custom Search

Home