Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

...

14/10/25 14:22:02 INFO SparkContext: Job finished: collect

at Word2Vec.scala:368, took 56.585983 s

14/10/25 14:22:02 INFO MappedRDD: Removing RDD 200 from

persistence list

14/10/25 14:22:02 INFO BlockManager: Removing RDD 200

14/10/25 14:22:02 INFO BlockManager: Removing block

rdd_200_0

14/10/25 14:22:02 INFO MemoryStore: Block rdd_200_0 of size

9008840 dropped from memory (free 1755596828)

word2vecModel: org.apache.spark.mllib.feature.Word2VecModel

= org.apache.spark.mllib.feature.Word2VecModel@2b94e480

Once trained, we can easily find the top 20 synonyms for a given term (that is, the most

similar term to the input term, computed by cosine similarity between the word vectors).

For example, to find the 20 most similar terms to hockey , use the following lines of code:

word2vecModel.findSynonyms("hockey", 20).foreach(println)

As we can see from the following output, most of the terms relate to hockey or other

sports topics:

(sport,0.6828256249427795)

(ecac,0.6718048453330994)

(hispanic,0.6519884467124939)

(glens,0.6447514891624451)

(woofers,0.6351765394210815)

(boxscores,0.6009076237678528)

(tournament,0.6006366014480591)

(champs,0.5957855582237244)

(aargh,0.584071934223175)

(playoff,0.5834275484085083)

(ahl,0.5784651637077332)

(ncaa,0.5680188536643982)

(pool,0.5612311959266663)

(olympic,0.5552600026130676)

(champion,0.5549421310424805)

(filinuk,0.5528956651687622)

(yankees,0.5502706170082092)

Search WWH ::

Custom Search

Home