Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

first feature and dividing the result by the square root of the variance (which we computed

earlier):

println((0.789131 - 0.41225805299526636)/

math.sqrt(0.1097424416755897))

The result should be equal to the first element of our scaled vector:

1.137647336497682

We can now retrain our model using the standardized data. We will use only the logistic

regression model to illustrate the impact of feature standardization (since the decision tree

and naïve Bayes are not impacted by this):

val lrModelScaled =

LogisticRegressionWithSGD.train(scaledData, numIterations)

val lrTotalCorrectScaled = scaledData.map { point =>

if (lrModelScaled.predict(point.features) == point.label)

1 else 0

}.sum

val lrAccuracyScaled = lrTotalCorrectScaled / numData

val lrPredictionsVsTrue = scaledData.map { point =>

(lrModelScaled.predict(point.features), point.label)

}

val lrMetricsScaled = new

BinaryClassificationMetrics(lrPredictionsVsTrue)

val lrPr = lrMetricsScaled.areaUnderPR

val lrRoc = lrMetricsScaled.areaUnderROC

println(f"${lrModelScaled.getClass.getSimpleName}\nAccuracy:

${lrAccuracyScaled * 100}%2.4f%%\nArea under PR: ${lrPr *

100.0}%2.4f%%\nArea under ROC: ${lrRoc * 100.0}%2.4f%%")

The result should look similar to this:

LogisticRegressionModel

Accuracy: 62.0419%

Area under PR: 72.7254%

Area under ROC: 61.9663%

Search WWH ::

Custom Search

Home