Database Reference
In-Depth Information
Document similarity with the 20 Newsgroups
dataset and TF-IDF features
You might recall from Chapter 4 , Building a Recommendation Engine with Spark , that the
similarity between two vectors can be computed using a distance metric. The closer two
vectors are (that is, the lower the distance metric), the more similar they are. One such met-
ric that we used to compute similarity between movies is cosine similarity.
Just like we did for movies, we can also compute the similarity between two documents.
Using TF-IDF, we have transformed each document into a vector representation. Hence, we
can use the same techniques as we used for movie vectors to compare two documents.
Intuitively, we might expect two documents to be more similar to each other if they share
many terms. Conversely, we might expect two documents to be less similar if they each
contain many terms that are different from each other. As we compute cosine similarity by
computing a dot product of the two vectors and each vector is made up of the terms in each
document, we can see that documents with a high overlap of terms will tend to have a high-
er cosine similarity.
Now, we can see TF-IDF at work. We might reasonably expect that even very different
documents might contain many overlapping terms that are relatively common (for example,
our stop words). However, due to a low TF-IDF weighting, these terms will not have a sig-
nificant impact on the dot product and, therefore, will not have much impact on the similar-
ity computed.
For example, we might expect two randomly chosen messages from the hockey news-
group to be relatively similar to each other. Let's see if this is the case:
val hockeyText = rdd.filter { case (file, text) =>
file.contains("hockey") }
val hockeyTF = hockeyText.mapValues(doc =>
hashingTF.transform(tokenize(doc)))
val hockeyTfIdf = idf.transform(hockeyTF.map(_._2))
In the preceding code, we first filtered our raw input RDD to keep only the messages with-
in the hockey topic. We then applied our tokenization and term frequency transformation
functions. Note that the transform method used is the version that works on a single
Search WWH ::




Custom Search