Database Reference
In-Depth Information
Word2Vec models
Until now, we have used a bag-of-words vector, optionally with some weighting scheme
such as TF-IDF to represent the text in a document. Another recent class of models that has
become popular is related to representing individual words as vectors.
These are generally based in some way on the co-occurrence statistics between the words
in a corpus. Once the vector representation is computed, we can use these vectors in ways
similar to how we might use TF-IDF vectors (such as using them as features for other ma-
chine learning models). One such common use case is computing the similarity between
two words with respect to their meanings, based on their vector representations.
Word2Vec refers to a specific implementation of one of these models, often referred to as
distributed vector representations
. The MLlib model uses a
skip-gram
model, which
seeks to learn vector representations that take into account the contexts in which words oc-
cur.
Note
While a detailed treatment of Word2Vec is beyond the scope of this topic, Spark's docu-
contains some further details on the algorithm as well as links to the reference implementa-
tion.
One of the main academic papers underlying Word2Vec is
Tomas Mikolov
,
Kai Chen
,
Greg
Corrado
, and
Jeffrey Dean
.
Efficient Estimation of Word Representations in Vector Space
.
In Proceedings of Workshop at ICLR
,
2013
.
It is available at
http://arxiv.org/pdf/1301.3781.pdf
.
Another recent model in the area of word vector representations is GloVe at
http://www-