Database Reference
In-Depth Information
10 tree depth, AUC = 76.26%
20 tree depth, AUC = 98.45%
Next, we will perform the same computation using the
Gini
impurity measure (we omit-
ted the code as it is very similar, but it can be found in the code bundle). Your results
should look something like this:
1 tree depth, AUC = 59.33%
2 tree depth, AUC = 61.68%
3 tree depth, AUC = 62.61%
4 tree depth, AUC = 63.63%
5 tree depth, AUC = 64.89%
10 tree depth, AUC = 78.37%
20 tree depth, AUC = 98.87%
As you can see from the preceding results, increasing the tree depth parameter results in a
more accurate model (as expected since the model is allowed to get more complex with
greater tree depth). It is very likely that at higher tree depths, the model will over-fit the
dataset significantly.
There is very little difference in performance between the two impurity measures.
The naïve Bayes model
Finally, let's see the impact of changing the
lambda
parameter for naïve Bayes. This
parameter controls additive smoothing, which handles the case when a class and feature
value do not occur together in the dataset.
Tip
See
http://en.wikipedia.org/wiki/Additive_smoothing
for more details on additive smooth-
ing.
We will take the same approach as we did earlier, first creating a convenience training
function and training the model with varying levels of
lambda
:
def trainNBWithParams(input: RDD[LabeledPoint], lambda:
Double) = {
val nb = new NaiveBayes
nb.setLambda(lambda)