Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Word2Vec models

Until now, we have used a bag-of-words vector, optionally with some weighting scheme

such as TF-IDF to represent the text in a document. Another recent class of models that has

become popular is related to representing individual words as vectors.

These are generally based in some way on the co-occurrence statistics between the words

in a corpus. Once the vector representation is computed, we can use these vectors in ways

similar to how we might use TF-IDF vectors (such as using them as features for other ma-

chine learning models). One such common use case is computing the similarity between

two words with respect to their meanings, based on their vector representations.

Word2Vec refers to a specific implementation of one of these models, often referred to as

distributed vector representations . The MLlib model uses a skip-gram model, which

seeks to learn vector representations that take into account the contexts in which words oc-

cur.

Note

While a detailed treatment of Word2Vec is beyond the scope of this topic, Spark's docu-

contains some further details on the algorithm as well as links to the reference implementa-

tion.

One of the main academic papers underlying Word2Vec is Tomas Mikolov , Kai Chen , Greg

Corrado , and Jeffrey Dean . Efficient Estimation of Word Representations in Vector Space .

In Proceedings of Workshop at ICLR , 2013 .

It is available at http://arxiv.org/pdf/1301.3781.pdf .

Another recent model in the area of word vector representations is GloVe at http://www-

Search WWH ::

Custom Search

Home