Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Document similarity with the 20 Newsgroups

dataset and TF-IDF features

You might recall from Chapter 4 , Building a Recommendation Engine with Spark , that the

similarity between two vectors can be computed using a distance metric. The closer two

vectors are (that is, the lower the distance metric), the more similar they are. One such met-

ric that we used to compute similarity between movies is cosine similarity.

Just like we did for movies, we can also compute the similarity between two documents.

Using TF-IDF, we have transformed each document into a vector representation. Hence, we

can use the same techniques as we used for movie vectors to compare two documents.

Intuitively, we might expect two documents to be more similar to each other if they share

many terms. Conversely, we might expect two documents to be less similar if they each

contain many terms that are different from each other. As we compute cosine similarity by

computing a dot product of the two vectors and each vector is made up of the terms in each

document, we can see that documents with a high overlap of terms will tend to have a high-

er cosine similarity.

Now, we can see TF-IDF at work. We might reasonably expect that even very different

documents might contain many overlapping terms that are relatively common (for example,

our stop words). However, due to a low TF-IDF weighting, these terms will not have a sig-

nificant impact on the dot product and, therefore, will not have much impact on the similar-

ity computed.

For example, we might expect two randomly chosen messages from the hockey news-

group to be relatively similar to each other. Let's see if this is the case:

val hockeyText = rdd.filter { case (file, text) =>

file.contains("hockey") }

val hockeyTF = hockeyText.mapValues(doc =>

hashingTF.transform(tokenize(doc)))

val hockeyTfIdf = idf.transform(hockeyTF.map(_._2))

In the preceding code, we first filtered our raw input RDD to keep only the messages with-

in the hockey topic. We then applied our tokenization and term frequency transformation

functions. Note that the transform method used is the version that works on a single

Search WWH ::

Custom Search

Home