Database Reference
In-Depth Information
Tip
Note that the
zip
operator assumes that each RDD has the same number of partitions as
well as the same number of elements in each partition. It will fail if this is not the case.
We can make this assumption here because we have effectively created both our
tfidf
RDD and
newsgroups
RDD from a series of
map
transformations on the same original
RDD that preserved the partitioning structure.
Now that we have an input RDD in the correct form, we can simply pass it to the naïve
Bayes
train
function:
val model = NaiveBayes.train(train, lambda = 0.1)
Let's evaluate the performance of the model on the test dataset. We will load the raw test
data from the
20news-bydate-test
directory, again using
wholeTextFiles
to
read each message into an RDD element. We will then extract the class labels from the
file paths in the same way as we did for the
newsgroups
RDD:
val testPath = "/PATH/20news-bydate-test/*"
val testRDD = sc.wholeTextFiles(testPath)
val testLabels = testRDD.map { case (file, text) =>
val topic = file.split("/").takeRight(2).head
newsgroupsMap(topic)
}
Transforming the text in the test dataset follows the same procedure as for the training
data—we will apply our
tokenize
function followed by the term frequency transforma-
tion, and we will again use the same IDF computed from the training data to transform the
TF vectors into TF-IDF vectors. Finally, we will zip the test class labels with the TF-IDF
vectors and create our test
RDD[LabeledPoint]
:
val testTf = testRDD.map { case (file, text) =>
hashingTF.transform(tokenize(text)) }
val testTfIdf = idf.transform(testTf)
val zippedTest = testLabels.zip(testTfIdf)
val test = zippedTest.map { case (topic, vector) =>
LabeledPoint(topic, vector) }