Database Reference
In-Depth Information
val graphicsTF = graphicsText.mapValues(doc =>
hashingTF.transform(tokenize(doc)))
val graphicsTfIdf = idf.transform(graphicsTF.map(_._2))
val graphics = graphicsTfIdf.sample(true, 0.1,
42).first.asInstanceOf[SV]
val breezeGraphics = new SparseVector(graphics.indices,
graphics.values, graphics.size)
val cosineSim2 = breeze1.dot(breezeGraphics) /
(norm(breeze1) * norm(breezeGraphics))
println(cosineSim2)
The cosine similarity is significantly lower at 0.0047:
0.004664850323792852
Finally, it is likely that a document from another sports-related topic might be more simil-
ar to our hockey document than one from a computer-related topic. However, we would
probably expect a baseball document to not be as similar as our hockey document.
Let's see whether this is the case by computing the similarity between a random message
from the baseball newsgroup and our hockey document:
val baseballText = rdd.filter { case (file, text) =>
file.contains("baseball") }
val baseballTF = baseballText.mapValues(doc =>
hashingTF.transform(tokenize(doc)))
val baseballTfIdf = idf.transform(baseballTF.map(_._2))
val baseball = baseballTfIdf.sample(true, 0.1,
42).first.asInstanceOf[SV]
val breezeBaseball = new SparseVector(baseball.indices,
baseball.values, baseball.size)
val cosineSim3 = breeze1.dot(breezeBaseball) /
(norm(breeze1) * norm(breezeBaseball))
println(cosineSim3)
Indeed, as we expected, we found that the baseball and hockey documents have a
cosine similarity of 0.05, which is significantly higher than the comp.graphics docu-
ment, but also somewhat lower than the other hockey document:
0.05047395039466008
Search WWH ::




Custom Search