Database Reference
In-Depth Information
Tip
Note that it is important that we use the training set IDF to transform the test data, as this
creates a more realistic estimation of model performance on new data, which might poten-
tially contain terms that the model has not yet been trained on. It would be "cheating" to
recompute the IDF vector based on the test dataset and, more importantly, would poten-
tially lead to incorrect estimates of optimal model parameters selected through cross-val-
idation.
Now, we're ready to compute the predictions and true class labels for our model. We will
use this RDD to compute accuracy and the multiclass weighted F-measure for our model:
val predictionAndLabel = test.map(p =>
(model.predict(p.features), p.label))
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 ==
x._2).count() / test.count()
val metrics = new MulticlassMetrics(predictionAndLabel)
println(accuracy)
println(metrics.weightedFMeasure)
Tip
The weighted F-measure is an overall measure of precision and recall performance
(where, like area under an ROC curve, values closer to 1.0 indicate better performance),
which is then combined through a weighted averaged across the classes.
We can see that our simple multiclass naïve Bayes model has achieved close to 80 percent
for both accuracy and F-measure:
0.7915560276155071
0.7810675969031116
Search WWH ::




Custom Search