Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

A note about stemming

A common step in text processing and tokenization is stemming . This is the conversion of

whole words to a base form (called a word stem ). For example, plurals might be conver-

ted to singular ( dogs becomes dog ), and forms such as walking and walker might become

walk . Stemming can become quite complex and is typically handled with specialized NLP

or search engine software (such as NLTK, OpenNLP, and Lucene, for example). We will

ignore stemming for the purpose of our example here.

Note

A full treatment of stemming is beyond the scope of this topic. You can find more details

at http://en.wikipedia.org/wiki/Stemming .

Training a TF-IDF model

We will now use MLlib to transform each document, in the form of processed tokens, into

a vector representation. The first step will be to use the HashingTF implementation,

which makes use of feature hashing to map each token in the input text to an index in the

vector of term frequencies. Then, we will compute the global IDF and use it to transform

the term frequency vectors into TF-IDF vectors.

For each token, the index will thus be the hash of the token (mapped in turn onto the di-

mension of the feature vector). The value for each token will be the TF-IDF weighting for

that token (that is, the term frequency multiplied by the inverse document frequency).

First, we will import the classes we need and create our HashingTF instance, passing in

a dim dimension parameter. While the default feature dimension is 2 20 (or around 1 mil-

lion), we will choose 2 18 (or around 260,000), since with about 50,000 tokens, we should

not experience a significant number of hash collisions, and a smaller dimension will be

more memory and processing friendly for illustrative purposes:

import org.apache.spark.mllib.linalg.{ SparseVector => SV }

import org.apache.spark.mllib.feature.HashingTF

import org.apache.spark.mllib.feature.IDF

val dim = math.pow(2, 18).toInt

val hashingTF = new HashingTF(dim)

val tf = hashingTF.transform(tokens)

tf.cache

Search WWH ::

Custom Search

Home