Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Using the correct form of data

Another critical aspect of model performance is using the correct form of data for each

model. Previously, we saw that applying a naïve Bayes model to our numerical features

resulted in very poor performance. Is this because the model itself is deficient?

In this case, recall that MLlib implements a multinomial model. This model works on input

in the form of non-zero count data. This can include a binary representation of categorical

features (such as the 1-of-k encoding covered previously) or frequency data (such as the

frequency of occurrences of words in a document). The numerical features we used initially

do not conform to this assumed input distribution, so it is probably unsurprising that the

model did so poorly.

To illustrate this, we'll use only the category feature, which, when 1-of-k encoded, is of the

correct form for the model. We will create a new dataset as follows:

val dataNB = records.map { r =>

val trimmed = r.map(_.replaceAll("\"", ""))

val label = trimmed(r.size - 1).toInt

val categoryIdx = categories(r(3))

val categoryFeatures = Array.ofDim[Double](numCategories)

categoryFeatures(categoryIdx) = 1.0

LabeledPoint(label, Vectors.dense(categoryFeatures))

}

Next, we will train a new naïve Bayes model and evaluate its performance:

val nbModelCats = NaiveBayes.train(dataNB)

val nbTotalCorrectCats = dataNB.map { point =>

if (nbModelCats.predict(point.features) == point.label) 1

else 0

}.sum

val nbAccuracyCats = nbTotalCorrectCats / numData

val nbPredictionsVsTrueCats = dataNB.map { point =>

(nbModelCats.predict(point.features), point.label)

}

val nbMetricsCats = new

BinaryClassificationMetrics(nbPredictionsVsTrueCats)

val nbPrCats = nbMetricsCats.areaUnderPR

Search WWH ::

Custom Search

Home