Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

10 tree depth, AUC = 76.26%

20 tree depth, AUC = 98.45%

Next, we will perform the same computation using the Gini impurity measure (we omit-

ted the code as it is very similar, but it can be found in the code bundle). Your results

should look something like this:

1 tree depth, AUC = 59.33%

2 tree depth, AUC = 61.68%

3 tree depth, AUC = 62.61%

4 tree depth, AUC = 63.63%

5 tree depth, AUC = 64.89%

10 tree depth, AUC = 78.37%

20 tree depth, AUC = 98.87%

As you can see from the preceding results, increasing the tree depth parameter results in a

more accurate model (as expected since the model is allowed to get more complex with

greater tree depth). It is very likely that at higher tree depths, the model will over-fit the

dataset significantly.

There is very little difference in performance between the two impurity measures.

The naïve Bayes model

Finally, let's see the impact of changing the lambda parameter for naïve Bayes. This

parameter controls additive smoothing, which handles the case when a class and feature

value do not occur together in the dataset.

Tip

See http://en.wikipedia.org/wiki/Additive_smoothing for more details on additive smooth-

ing.

We will take the same approach as we did earlier, first creating a convenience training

function and training the model with varying levels of lambda :

def trainNBWithParams(input: RDD[LabeledPoint], lambda:

Double) = {

val nb = new NaiveBayes

nb.setLambda(lambda)

Search WWH ::

Custom Search

Home