Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

regResults.foreach { case (param, auc) => println(f"$param,

AUC = ${auc * 100}%2.2f%%") }

Your output should look like this:

0.001 L2 regularization parameter, AUC = 66.55%

0.01 L2 regularization parameter, AUC = 66.55%

0.1 L2 regularization parameter, AUC = 66.63%

1.0 L2 regularization parameter, AUC = 66.04%

10.0 L2 regularization parameter, AUC = 35.33%

As we can see, at low levels of regularization, there is not much impact in model perform-

ance. However, as we increase regularization, we can see the impact of under-fitting on

our model evaluation.

Tip

You will find similar results when using the L1 regularization. Give it a try by performing

the same evaluation of regularization parameter against the AUC measure for L1Up-

dater .

Decision trees

The decision tree model we trained earlier was the best performer on the raw data that we

first used. We set a parameter called maxDepth , which controls the maximum depth of

the tree and, thus, the complexity of the model. Deeper trees result in more complex mod-

els that will be able to fit the data better.

For classification problems, we can also select between two measures of impurity: Gini

and Entropy .

Tuning tree depth and impurity

We will illustrate the impact of tree depth in a similar manner as we did for our logistic re-

gression model.

First, we will need to create another helper function in the Spark shell:

import org.apache.spark.mllib.tree.impurity.Impurity

import org.apache.spark.mllib.tree.impurity.Entropy

import org.apache.spark.mllib.tree.impurity.Gini

Search WWH ::

Custom Search

Home