Database Reference
In-Depth Information
Word2Vec models
Until now, we have used a bag-of-words vector, optionally with some weighting scheme
such as TF-IDF to represent the text in a document. Another recent class of models that has
become popular is related to representing individual words as vectors.
These are generally based in some way on the co-occurrence statistics between the words
in a corpus. Once the vector representation is computed, we can use these vectors in ways
similar to how we might use TF-IDF vectors (such as using them as features for other ma-
chine learning models). One such common use case is computing the similarity between
two words with respect to their meanings, based on their vector representations.
Word2Vec refers to a specific implementation of one of these models, often referred to as
distributed vector representations . The MLlib model uses a skip-gram model, which
seeks to learn vector representations that take into account the contexts in which words oc-
cur.
Note
While a detailed treatment of Word2Vec is beyond the scope of this topic, Spark's docu-
mentation at http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec
contains some further details on the algorithm as well as links to the reference implementa-
tion.
One of the main academic papers underlying Word2Vec is Tomas Mikolov , Kai Chen , Greg
Corrado , and Jeffrey Dean . Efficient Estimation of Word Representations in Vector Space .
In Proceedings of Workshop at ICLR , 2013 .
It is available at http://arxiv.org/pdf/1301.3781.pdf .
Another recent model in the area of word vector representations is GloVe at http://www-
nlp.stanford.edu/projects/glove/ .
Search WWH ::




Custom Search