Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

def trainDTWithParams(input: RDD[LabeledPoint], maxDepth:

Int, impurity: Impurity) = {

DecisionTree.train(input, Algo.Classification, impurity,

maxDepth)

}

Now, we're ready to compute our AUC metric for different settings of tree depth. We will

simply use our original dataset in this example since we do not need the data to be stand-

ardized.

Tip

Note that decision tree models generally do not require features to be standardized or nor-

malized, nor do they require categorical features to be binary-encoded.

First, train the model using the Entropy impurity measure and varying tree depths:

val dtResultsEntropy = Seq(1, 2, 3, 4, 5, 10, 20).map {

param =>

val model = trainDTWithParams(data, param, Entropy)

val scoreAndLabels = data.map { point =>

val score = model.predict(point.features)

(if (score > 0.5) 1.0 else 0.0, point.label)

}

val metrics = new

BinaryClassificationMetrics(scoreAndLabels)

(s"$param tree depth", metrics.areaUnderROC)

}

dtResultsEntropy.foreach { case (param, auc) =>

println(f"$param, AUC = ${auc * 100}%2.2f%%") }

This should output the results shown here:

1 tree depth, AUC = 59.33%

2 tree depth, AUC = 61.68%

3 tree depth, AUC = 62.61%

4 tree depth, AUC = 63.63%

5 tree depth, AUC = 64.88%

Search WWH ::

Custom Search

Home