Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Training a text classifier on the 20 Newsgroups

dataset using TF-IDF

When using TF-IDF vectors, we expected that the cosine similarity measure would capture

the similarity between documents, based on the overlap of terms between them. In a similar

way, we would expect that a machine learning model, such as a classifier, would be able to

learn weightings for individual terms; this would allow it to distinguish between documents

from different classes. That is, it should be possible to learn a mapping between the presen-

ce (and weighting) of certain terms and a specific topic.

In the 20 Newsgroups example, each newsgroup topic is a class, and we can train a classifi-

er using our TF-IDF transformed vectors as input.

Since we are dealing with a multiclass classification problem, we will use the naïve Bayes

model in MLlib, which supports multiple classes. As the first step, we will import the

Spark classes that we will be using:

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.classification.NaiveBayes

import org.apache.spark.mllib.evaluation.MulticlassMetrics

Next, we will need to extract the 20 topics and convert them to class mappings. We can do

this in exactly the same way as we might for 1-of-K feature encoding, by assigning a nu-

meric index to each class:

val newsgroupsMap =

newsgroups.distinct.collect().zipWithIndex.toMap

val zipped = newsgroups.zip(tfidf)

val train = zipped.map { case (topic, vector) =>

LabeledPoint(newsgroupsMap(topic), vector) }

train.cache

In the preceding code snippet, we took the newsgroups RDD, where each element is the

topic, and used the zip function to combine it with each element in our tfidf RDD of

TF-IDF vectors. We then mapped over each key-value element in our new zipped RDD

and created a LabeledPoint instance, where label is the class index and features

is the TF-IDF vector.

Search WWH ::

Custom Search

Home