Database Reference
In-Depth Information
squared_log_error(t, p)).mean())
return rmsle
Tree depth
We would generally expect performance to increase with more complex trees (that is,
trees of greater depth). Having a lower tree depth acts as a form of regularization, and it
might be the case that as with L2 or L1 regularization in linear models, there is a tree
depth that is optimal with respect to the test set performance.
Here, we will try to increase the depths of trees to see what impact they have on test set
RMSLE, keeping the number of bins at the default level of 32 :
params = [1, 2, 3, 4, 5, 10, 20]
metrics = [evaluate_dt(train_data_dt, test_data_dt, param,
32) for param in params]
print params
print metrics
plot(params, metrics)
fig = matplotlib.pyplot.gcf()
In this case, it appears that the decision tree starts over-fitting at deeper tree levels. An op-
timal tree depth appears to be around 10 on this dataset.
Note
Notice that our best RMSLE of 0.42 is now quite close to the Kaggle winner of around
0.29!
The output of the tree depth is as follows:
[1, 2, 3, 4, 5, 10, 20]
[1.0280339660196287, 0.92686672078778276,
0.81807794023407532, 0.74060228537329209,
0.63583503599563096, 0.42851360418692447,
0.45500008049779139]
Search WWH ::




Custom Search