Database Reference
In-Depth Information
Comparing raw features with processed TF-IDF
features on the 20 Newsgroups dataset
In this example, we will simply apply the hashing term frequency transformation to the raw
text tokens obtained using a simple whitespace splitting of the document text. We will train
a model on this data and evaluate the performance on the test set as we did for the model
trained with TF-IDF features:
val rawTokens = rdd.map { case (file, text) => text.split("
") }
val rawTF = texrawTokenst.map(doc =>
hashingTF.transform(doc))
val rawTrain = newsgroups.zip(rawTF).map { case (topic,
vector) => LabeledPoint(newsgroupsMap(topic), vector) }
val rawModel = NaiveBayes.train(rawTrain, lambda = 0.1)
val rawTestTF = testRDD.map { case (file, text) =>
hashingTF.transform(text.split(" ")) }
val rawZippedTest = testLabels.zip(rawTestTF)
val rawTest = rawZippedTest.map { case (topic, vector) =>
LabeledPoint(topic, vector) }
val rawPredictionAndLabel = rawTest.map(p =>
(rawModel.predict(p.features), p.label))
val rawAccuracy = 1.0 * rawPredictionAndLabel.filter(x =>
x._1 == x._2).count() / rawTest.count()
println(rawAccuracy)
val rawMetrics = new MulticlassMetrics(rawPredictionAndLabel)
println(rawMetrics.weightedFMeasure)
Perhaps surprisingly, the raw model does quite well, although both accuracy and F-measure
are a few percentage points lower than those of the TF-IDF model. This is also partly a re-
flection of the fact that the naïve Bayes model is well suited to data in the form of raw fre-
quency counts:
0.7661975570897503
0.7628947184990661
Search WWH ::




Custom Search