Database Reference
In-Depth Information
Training a text classifier on the 20 Newsgroups
dataset using TF-IDF
When using TF-IDF vectors, we expected that the cosine similarity measure would capture
the similarity between documents, based on the overlap of terms between them. In a similar
way, we would expect that a machine learning model, such as a classifier, would be able to
learn weightings for individual terms; this would allow it to distinguish between documents
from different classes. That is, it should be possible to learn a mapping between the presen-
ce (and weighting) of certain terms and a specific topic.
In the 20 Newsgroups example, each newsgroup topic is a class, and we can train a classifi-
er using our TF-IDF transformed vectors as input.
Since we are dealing with a multiclass classification problem, we will use the naïve Bayes
model in MLlib, which supports multiple classes. As the first step, we will import the
Spark classes that we will be using:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.evaluation.MulticlassMetrics
Next, we will need to extract the 20 topics and convert them to class mappings. We can do
this in exactly the same way as we might for 1-of-K feature encoding, by assigning a nu-
meric index to each class:
val newsgroupsMap =
newsgroups.distinct.collect().zipWithIndex.toMap
val zipped = newsgroups.zip(tfidf)
val train = zipped.map { case (topic, vector) =>
LabeledPoint(newsgroupsMap(topic), vector) }
train.cache
In the preceding code snippet, we took the newsgroups RDD, where each element is the
topic, and used the zip function to combine it with each element in our tfidf RDD of
TF-IDF vectors. We then mapped over each key-value element in our new zipped RDD
and created a LabeledPoint instance, where label is the class index and features
is the TF-IDF vector.
Search WWH ::




Custom Search