Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Comparing raw features with processed TF-IDF

features on the 20 Newsgroups dataset

In this example, we will simply apply the hashing term frequency transformation to the raw

text tokens obtained using a simple whitespace splitting of the document text. We will train

a model on this data and evaluate the performance on the test set as we did for the model

trained with TF-IDF features:

val rawTokens = rdd.map { case (file, text) => text.split("

") }

val rawTF = texrawTokenst.map(doc =>

hashingTF.transform(doc))

val rawTrain = newsgroups.zip(rawTF).map { case (topic,

vector) => LabeledPoint(newsgroupsMap(topic), vector) }

val rawModel = NaiveBayes.train(rawTrain, lambda = 0.1)

val rawTestTF = testRDD.map { case (file, text) =>

hashingTF.transform(text.split(" ")) }

val rawZippedTest = testLabels.zip(rawTestTF)

val rawTest = rawZippedTest.map { case (topic, vector) =>

LabeledPoint(topic, vector) }

val rawPredictionAndLabel = rawTest.map(p =>

(rawModel.predict(p.features), p.label))

val rawAccuracy = 1.0 * rawPredictionAndLabel.filter(x =>

x._1 == x._2).count() / rawTest.count()

println(rawAccuracy)

val rawMetrics = new MulticlassMetrics(rawPredictionAndLabel)

println(rawMetrics.weightedFMeasure)

Perhaps surprisingly, the raw model does quite well, although both accuracy and F-measure

are a few percentage points lower than those of the TF-IDF model. This is also partly a re-

flection of the fact that the naïve Bayes model is well suited to data in the form of raw fre-

quency counts:

0.7661975570897503

0.7628947184990661

Search WWH ::

Custom Search

Home