Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

val graphicsTF = graphicsText.mapValues(doc =>

hashingTF.transform(tokenize(doc)))

val graphicsTfIdf = idf.transform(graphicsTF.map(_._2))

val graphics = graphicsTfIdf.sample(true, 0.1,

42).first.asInstanceOf[SV]

val breezeGraphics = new SparseVector(graphics.indices,

graphics.values, graphics.size)

val cosineSim2 = breeze1.dot(breezeGraphics) /

(norm(breeze1) * norm(breezeGraphics))

println(cosineSim2)

The cosine similarity is significantly lower at 0.0047:

0.004664850323792852

Finally, it is likely that a document from another sports-related topic might be more simil-

ar to our hockey document than one from a computer-related topic. However, we would

probably expect a baseball document to not be as similar as our hockey document.

Let's see whether this is the case by computing the similarity between a random message

from the baseball newsgroup and our hockey document:

val baseballText = rdd.filter { case (file, text) =>

file.contains("baseball") }

val baseballTF = baseballText.mapValues(doc =>

hashingTF.transform(tokenize(doc)))

val baseballTfIdf = idf.transform(baseballTF.map(_._2))

val baseball = baseballTfIdf.sample(true, 0.1,

42).first.asInstanceOf[SV]

val breezeBaseball = new SparseVector(baseball.indices,

baseball.values, baseball.size)

val cosineSim3 = breeze1.dot(breezeBaseball) /

(norm(breeze1) * norm(breezeBaseball))

println(cosineSim3)

Indeed, as we expected, we found that the baseball and hockey documents have a

cosine similarity of 0.05, which is significantly higher than the comp.graphics docu-

ment, but also somewhat lower than the other hockey document:

0.05047395039466008

Search WWH ::

Custom Search

Home