Database Reference
In-Depth Information
Tip
Note that the zip operator assumes that each RDD has the same number of partitions as
well as the same number of elements in each partition. It will fail if this is not the case.
We can make this assumption here because we have effectively created both our tfidf
RDD and newsgroups RDD from a series of map transformations on the same original
RDD that preserved the partitioning structure.
Now that we have an input RDD in the correct form, we can simply pass it to the naïve
Bayes train function:
val model = NaiveBayes.train(train, lambda = 0.1)
Let's evaluate the performance of the model on the test dataset. We will load the raw test
data from the 20news-bydate-test directory, again using wholeTextFiles to
read each message into an RDD element. We will then extract the class labels from the
file paths in the same way as we did for the newsgroups RDD:
val testPath = "/PATH/20news-bydate-test/*"
val testRDD = sc.wholeTextFiles(testPath)
val testLabels = testRDD.map { case (file, text) =>
val topic = file.split("/").takeRight(2).head
newsgroupsMap(topic)
}
Transforming the text in the test dataset follows the same procedure as for the training
data—we will apply our tokenize function followed by the term frequency transforma-
tion, and we will again use the same IDF computed from the training data to transform the
TF vectors into TF-IDF vectors. Finally, we will zip the test class labels with the TF-IDF
vectors and create our test RDD[LabeledPoint] :
val testTf = testRDD.map { case (file, text) =>
hashingTF.transform(tokenize(text)) }
val testTfIdf = idf.transform(testTf)
val zippedTest = testLabels.zip(testTfIdf)
val test = zippedTest.map { case (topic, vector) =>
LabeledPoint(topic, vector) }
Search WWH ::




Custom Search