Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

document (in the form of a Seq[String] ) rather than the version that works on an

RDD of documents.

Finally, we applied the IDF transform (note that we use the same IDF that we have

already computed on the whole corpus).

Once we have our hockey document vectors, we can select two of these vectors at ran-

dom and compute the cosine similarity between them (as we did earlier, we will use

Breeze for the linear algebra functionality, in particular converting our MLlib vectors to

Breeze SparseVector instances first):

import breeze.linalg._

val hockey1 = hockeyTfIdf.sample(true, 0.1,

42).first.asInstanceOf[SV]

val breeze1 = new SparseVector(hockey1.indices,

hockey1.values, hockey1.size)

val hockey2 = hockeyTfIdf.sample(true, 0.1,

43).first.asInstanceOf[SV]

val breeze2 = new SparseVector(hockey2.indices,

hockey2.values, hockey2.size)

val cosineSim = breeze1.dot(breeze2) / (norm(breeze1) *

norm(breeze2))

println(cosineSim)

We can see that the cosine similarity between the documents is around 0.06:

0.060250114361164626

While this might seem quite low, recall that the effective dimensionality of our features is

high due to the large number of unique terms that is typical when dealing with text data.

Hence, we can expect that any two documents might have a relatively low overlap of

terms even if they are about the same topic, and therefore would have a lower absolute

similarity score.

By contrast, we can compare this similarity score to the one computed between one of our

hockey documents and another document chosen randomly from the comp.graphics

newsgroup, using the same methodology:

val graphicsText = rdd.filter { case (file, text) =>

file.contains("comp.graphics") }

Search WWH ::

Custom Search

Home