Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Word2Vec on the 20 Newsgroups dataset

Training a Word2Vec model in Spark is relatively simple. We will pass in an RDD where

each element is a sequence of terms. We can use the RDD of tokenized documents we have

already created as input to the model:

import org.apache.spark.mllib.feature.Word2Vec

val word2vec = new Word2Vec()

word2vec.setSeed(42)

val word2vecModel = word2vec.fit(tokens)

Tip

Note that we used setSeed to set the random seed for model training so that you can see

the same results each time the model is trained.

You will see some output similar to the following while the model is being trained:

...

14/10/25 14:21:59 INFO Word2Vec: wordCount = 2133172, alpha

= 0.0011868763094487506

14/10/25 14:21:59 INFO Word2Vec: wordCount = 2144172, alpha

= 0.0010640806039941193

14/10/25 14:21:59 INFO Word2Vec: wordCount = 2155172, alpha

= 9.412848985394907E-4

14/10/25 14:21:59 INFO Word2Vec: wordCount = 2166172, alpha

= 8.184891930848592E-4

14/10/25 14:22:00 INFO Word2Vec: wordCount = 2177172, alpha

= 6.956934876302307E-4

14/10/25 14:22:00 INFO Word2Vec: wordCount = 2188172, alpha

= 5.728977821755993E-4

14/10/25 14:22:00 INFO Word2Vec: wordCount = 2199172, alpha

= 4.501020767209707E-4

14/10/25 14:22:00 INFO Word2Vec: wordCount = 2210172, alpha

= 3.2730637126634213E-4

14/10/25 14:22:01 INFO Word2Vec: wordCount = 2221172, alpha

= 2.0451066581171076E-4

14/10/25 14:22:01 INFO Word2Vec: wordCount = 2232172, alpha

= 8.171496035708214E-5

Search WWH ::

Custom Search

Home