Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

squared_log_error(t, p)).mean())

return rmsle

Tree depth

We would generally expect performance to increase with more complex trees (that is,

trees of greater depth). Having a lower tree depth acts as a form of regularization, and it

might be the case that as with L2 or L1 regularization in linear models, there is a tree

depth that is optimal with respect to the test set performance.

Here, we will try to increase the depths of trees to see what impact they have on test set

RMSLE, keeping the number of bins at the default level of 32 :

params = [1, 2, 3, 4, 5, 10, 20]

metrics = [evaluate_dt(train_data_dt, test_data_dt, param,

32) for param in params]

print params

print metrics

plot(params, metrics)

fig = matplotlib.pyplot.gcf()

In this case, it appears that the decision tree starts over-fitting at deeper tree levels. An op-

timal tree depth appears to be around 10 on this dataset.

Note

Notice that our best RMSLE of 0.42 is now quite close to the Kaggle winner of around

0.29!

The output of the tree depth is as follows:

[1, 2, 3, 4, 5, 10, 20]

[1.0280339660196287, 0.92686672078778276,

0.81807794023407532, 0.74060228537329209,

0.63583503599563096, 0.42851360418692447,

0.45500008049779139]

Search WWH ::

Custom Search

Home