Database Reference
In-Depth Information
A note about stemming
A common step in text processing and tokenization is stemming . This is the conversion of
whole words to a base form (called a word stem ). For example, plurals might be conver-
ted to singular ( dogs becomes dog ), and forms such as walking and walker might become
walk . Stemming can become quite complex and is typically handled with specialized NLP
or search engine software (such as NLTK, OpenNLP, and Lucene, for example). We will
ignore stemming for the purpose of our example here.
Note
A full treatment of stemming is beyond the scope of this topic. You can find more details
at http://en.wikipedia.org/wiki/Stemming .
Training a TF-IDF model
We will now use MLlib to transform each document, in the form of processed tokens, into
a vector representation. The first step will be to use the HashingTF implementation,
which makes use of feature hashing to map each token in the input text to an index in the
vector of term frequencies. Then, we will compute the global IDF and use it to transform
the term frequency vectors into TF-IDF vectors.
For each token, the index will thus be the hash of the token (mapped in turn onto the di-
mension of the feature vector). The value for each token will be the TF-IDF weighting for
that token (that is, the term frequency multiplied by the inverse document frequency).
First, we will import the classes we need and create our HashingTF instance, passing in
a dim dimension parameter. While the default feature dimension is 2 20 (or around 1 mil-
lion), we will choose 2 18 (or around 260,000), since with about 50,000 tokens, we should
not experience a significant number of hash collisions, and a smaller dimension will be
more memory and processing friendly for illustrative purposes:
import org.apache.spark.mllib.linalg.{ SparseVector => SV }
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.feature.IDF
val dim = math.pow(2, 18).toInt
val hashingTF = new HashingTF(dim)
val tf = hashingTF.transform(tokens)
tf.cache
Search WWH ::




Custom Search