Database Reference
In-Depth Information
document (in the form of a Seq[String] ) rather than the version that works on an
RDD of documents.
Finally, we applied the IDF transform (note that we use the same IDF that we have
already computed on the whole corpus).
Once we have our hockey document vectors, we can select two of these vectors at ran-
dom and compute the cosine similarity between them (as we did earlier, we will use
Breeze for the linear algebra functionality, in particular converting our MLlib vectors to
Breeze SparseVector instances first):
import breeze.linalg._
val hockey1 = hockeyTfIdf.sample(true, 0.1,
42).first.asInstanceOf[SV]
val breeze1 = new SparseVector(hockey1.indices,
hockey1.values, hockey1.size)
val hockey2 = hockeyTfIdf.sample(true, 0.1,
43).first.asInstanceOf[SV]
val breeze2 = new SparseVector(hockey2.indices,
hockey2.values, hockey2.size)
val cosineSim = breeze1.dot(breeze2) / (norm(breeze1) *
norm(breeze2))
println(cosineSim)
We can see that the cosine similarity between the documents is around 0.06:
0.060250114361164626
While this might seem quite low, recall that the effective dimensionality of our features is
high due to the large number of unique terms that is typical when dealing with text data.
Hence, we can expect that any two documents might have a relatively low overlap of
terms even if they are about the same topic, and therefore would have a lower absolute
similarity score.
By contrast, we can compare this similarity score to the one computed between one of our
hockey documents and another document chosen randomly from the comp.graphics
newsgroup, using the same methodology:
val graphicsText = rdd.filter { case (file, text) =>
file.contains("comp.graphics") }
Search WWH ::




Custom Search