Database Reference
In-Depth Information
In a real pipeline, you will likely need to preprocess and stem
words in a document before passing them to TF. For example, you
might convert all words to lowercase, drop punctuation characters,
and drop suffixes like ing . For best results you can call a single-
node natural language library like NLTK in a map() .
Once you have built term frequency vectors, you can use IDF to compute the inverse
document frequencies, and multiply them with the term frequencies to compute the
TF-IDF. You first call fit() on an IDF object to obtain an IDFModel representing the
inverse document frequencies in the corpus, then call transform() on the model to
transform TF vectors into IDF vectors. Example 11-8 shows how you would compute
IDF starting with Example 11-7 .
Example 11-8. Using TF-IDF in Python
from pyspark.mllib.feature import HashingTF , IDF
# Read a set of text files as TF vectors
rdd = sc . wholeTextFiles ( "data" ) . map ( lambda ( name , text ): text . split ())
tf = HashingTF ()
tfVectors = tf . transform ( rdd ) . cache ()
# Compute the IDF, then the TF-IDF vectors
idf = IDF ()
idfModel = idf . fit ( tfVectors )
tfIdfVectors = idfModel . transform ( tfVectors )
Note that we called cache() on the tfVectors RDD because it is used twice (once to
train the IDF model, and once to multiply the TF vectors by the IDF).
Scaling
Most machine learning algorithms consider the magnitude of each element in the
feature vector, and thus work best when the features are scaled so they weigh equally
(e.g., all features have a mean of 0 and standard deviation of 1). Once you have built
feature vectors, you can use the StandardScaler class in MLlib to do this scaling,
both for the mean and the standard deviation. You create a StandardScaler , call
fit() on a dataset to obtain a StandardScalerModel (i.e., compute the mean and
variance of each column), and then call transform() on the model to scale a dataset.
Example 11-9 demonstrates.
Example 11-9. Scaling vectors in Python
from pyspark.mllib.feature import StandardScaler
vectors = [ Vectors . dense ([ - 2.0 , 5.0 , 1.0 ]), Vectors . dense ([ 2.0 , 0.0 , 1.0 ])]
Search WWH ::




Custom Search