Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Tip

Note that the zip operator assumes that each RDD has the same number of partitions as

well as the same number of elements in each partition. It will fail if this is not the case.

We can make this assumption here because we have effectively created both our tfidf

RDD and newsgroups RDD from a series of map transformations on the same original

RDD that preserved the partitioning structure.

Now that we have an input RDD in the correct form, we can simply pass it to the naïve

Bayes train function:

val model = NaiveBayes.train(train, lambda = 0.1)

Let's evaluate the performance of the model on the test dataset. We will load the raw test

data from the 20news-bydate-test directory, again using wholeTextFiles to

read each message into an RDD element. We will then extract the class labels from the

file paths in the same way as we did for the newsgroups RDD:

val testPath = "/PATH/20news-bydate-test/*"

val testRDD = sc.wholeTextFiles(testPath)

val testLabels = testRDD.map { case (file, text) =>

val topic = file.split("/").takeRight(2).head

newsgroupsMap(topic)

}

Transforming the text in the test dataset follows the same procedure as for the training

data—we will apply our tokenize function followed by the term frequency transforma-

tion, and we will again use the same IDF computed from the training data to transform the

TF vectors into TF-IDF vectors. Finally, we will zip the test class labels with the TF-IDF

vectors and create our test RDD[LabeledPoint] :

val testTf = testRDD.map { case (file, text) =>

hashingTF.transform(tokenize(text)) }

val testTfIdf = idf.transform(testTf)

val zippedTest = testLabels.zip(testTfIdf)

val test = zippedTest.map { case (topic, vector) =>

LabeledPoint(topic, vector) }

Search WWH ::

Custom Search

Home