Database Reference
In-Depth Information
Using the correct form of data
Another critical aspect of model performance is using the correct form of data for each
model. Previously, we saw that applying a naïve Bayes model to our numerical features
resulted in very poor performance. Is this because the model itself is deficient?
In this case, recall that MLlib implements a multinomial model. This model works on input
in the form of non-zero count data. This can include a binary representation of categorical
features (such as the 1-of-k encoding covered previously) or frequency data (such as the
frequency of occurrences of words in a document). The numerical features we used initially
do not conform to this assumed input distribution, so it is probably unsurprising that the
model did so poorly.
To illustrate this, we'll use only the category feature, which, when 1-of-k encoded, is of the
correct form for the model. We will create a new dataset as follows:
val dataNB = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val categoryIdx = categories(r(3))
val categoryFeatures = Array.ofDim[Double](numCategories)
categoryFeatures(categoryIdx) = 1.0
LabeledPoint(label, Vectors.dense(categoryFeatures))
}
Next, we will train a new naïve Bayes model and evaluate its performance:
val nbModelCats = NaiveBayes.train(dataNB)
val nbTotalCorrectCats = dataNB.map { point =>
if (nbModelCats.predict(point.features) == point.label) 1
else 0
}.sum
val nbAccuracyCats = nbTotalCorrectCats / numData
val nbPredictionsVsTrueCats = dataNB.map { point =>
(nbModelCats.predict(point.features), point.label)
}
val nbMetricsCats = new
BinaryClassificationMetrics(nbPredictionsVsTrueCats)
val nbPrCats = nbMetricsCats.areaUnderPR
Search WWH ::




Custom Search