Database Reference
In-Depth Information
Word2Vec on the 20 Newsgroups dataset
Training a Word2Vec model in Spark is relatively simple. We will pass in an RDD where
each element is a sequence of terms. We can use the RDD of tokenized documents we have
already created as input to the model:
import org.apache.spark.mllib.feature.Word2Vec
val word2vec = new Word2Vec()
word2vec.setSeed(42)
val word2vecModel = word2vec.fit(tokens)
Tip
Note that we used setSeed to set the random seed for model training so that you can see
the same results each time the model is trained.
You will see some output similar to the following while the model is being trained:
...
14/10/25 14:21:59 INFO Word2Vec: wordCount = 2133172, alpha
= 0.0011868763094487506
14/10/25 14:21:59 INFO Word2Vec: wordCount = 2144172, alpha
= 0.0010640806039941193
14/10/25 14:21:59 INFO Word2Vec: wordCount = 2155172, alpha
= 9.412848985394907E-4
14/10/25 14:21:59 INFO Word2Vec: wordCount = 2166172, alpha
= 8.184891930848592E-4
14/10/25 14:22:00 INFO Word2Vec: wordCount = 2177172, alpha
= 6.956934876302307E-4
14/10/25 14:22:00 INFO Word2Vec: wordCount = 2188172, alpha
= 5.728977821755993E-4
14/10/25 14:22:00 INFO Word2Vec: wordCount = 2199172, alpha
= 4.501020767209707E-4
14/10/25 14:22:00 INFO Word2Vec: wordCount = 2210172, alpha
= 3.2730637126634213E-4
14/10/25 14:22:01 INFO Word2Vec: wordCount = 2221172, alpha
= 2.0451066581171076E-4
14/10/25 14:22:01 INFO Word2Vec: wordCount = 2232172, alpha
= 8.171496035708214E-5
Search WWH ::




Custom Search