Database Reference
In-Depth Information
val graphicsTF = graphicsText.mapValues(doc =>
hashingTF.transform(tokenize(doc)))
val graphicsTfIdf = idf.transform(graphicsTF.map(_._2))
val graphics = graphicsTfIdf.sample(true, 0.1,
42).first.asInstanceOf[SV]
val breezeGraphics = new SparseVector(graphics.indices,
graphics.values, graphics.size)
val cosineSim2 = breeze1.dot(breezeGraphics) /
(norm(breeze1) * norm(breezeGraphics))
println(cosineSim2)
The cosine similarity is significantly lower at 0.0047:
0.004664850323792852
Finally, it is likely that a document from another sports-related topic might be more simil-
ar to our
hockey
document than one from a computer-related topic. However, we would
probably expect a
baseball
document to not be as similar as our
hockey
document.
Let's see whether this is the case by computing the similarity between a random message
from the
baseball
newsgroup and our
hockey
document:
val baseballText = rdd.filter { case (file, text) =>
file.contains("baseball") }
val baseballTF = baseballText.mapValues(doc =>
hashingTF.transform(tokenize(doc)))
val baseballTfIdf = idf.transform(baseballTF.map(_._2))
val baseball = baseballTfIdf.sample(true, 0.1,
42).first.asInstanceOf[SV]
val breezeBaseball = new SparseVector(baseball.indices,
baseball.values, baseball.size)
val cosineSim3 = breeze1.dot(breezeBaseball) /
(norm(breeze1) * norm(breezeBaseball))
println(cosineSim3)
Indeed, as we expected, we found that the
baseball
and
hockey
documents have a
cosine similarity of 0.05, which is significantly higher than the
comp.graphics
docu-
ment, but also somewhat lower than the other
hockey
document:
0.05047395039466008