Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Tip

Note that it is important that we use the training set IDF to transform the test data, as this

creates a more realistic estimation of model performance on new data, which might poten-

tially contain terms that the model has not yet been trained on. It would be "cheating" to

recompute the IDF vector based on the test dataset and, more importantly, would poten-

tially lead to incorrect estimates of optimal model parameters selected through cross-val-

idation.

Now, we're ready to compute the predictions and true class labels for our model. We will

use this RDD to compute accuracy and the multiclass weighted F-measure for our model:

val predictionAndLabel = test.map(p =>

(model.predict(p.features), p.label))

val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 ==

x._2).count() / test.count()

val metrics = new MulticlassMetrics(predictionAndLabel)

println(accuracy)

println(metrics.weightedFMeasure)

Tip

The weighted F-measure is an overall measure of precision and recall performance

(where, like area under an ROC curve, values closer to 1.0 indicate better performance),

which is then combined through a weighted averaged across the classes.

We can see that our simple multiclass naïve Bayes model has achieved close to 80 percent

for both accuracy and F-measure:

0.7915560276155071

0.7810675969031116

Search WWH ::

Custom Search

Home