Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

You will see the following on the screen:

[-0.023261105535492967,2.720728254208072,-0.4464200056407091,-0.2205258360869135,

...

Tip

Note that while the original raw features were sparse (that is, there are many entries that

are zero), if we subtract the mean from each entry, we would end up with a non-sparse

(dense) representation, as can be seen in the preceding example.

This is not a problem in this case as the data size is small, but often large-scale real-world

problems have extremely sparse input data with many features (online advertising and text

classification are good examples). In this case, it is not advisable to lose this sparsity, as

the memory and processing requirements for the equivalent dense representation can

quickly explode with many millions of features. We can use StandardScaler and set

withMean to false to avoid this.

We're now ready to train a new logistic regression model with our expanded feature set,

and then we will evaluate the performance:

val lrModelScaledCats =

LogisticRegressionWithSGD.train(scaledDataCats,

numIterations)

val lrTotalCorrectScaledCats = scaledDataCats.map { point =>

if (lrModelScaledCats.predict(point.features) ==

point.label) 1 else 0

}.sum

val lrAccuracyScaledCats = lrTotalCorrectScaledCats /

numData

val lrPredictionsVsTrueCats = scaledDataCats.map { point =>

(lrModelScaledCats.predict(point.features), point.label)

}

val lrMetricsScaledCats = new

BinaryClassificationMetrics(lrPredictionsVsTrueCats)

val lrPrCats = lrMetricsScaledCats.areaUnderPR

val lrRocCats = lrMetricsScaledCats.areaUnderROC

println(f"${lrModelScaledCats.getClass.getSimpleName}\nAccuracy:

${lrAccuracyScaledCats * 100}%2.4f%%\nArea under PR:

Search WWH ::

Custom Search

Home