Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

In a real pipeline, you will likely need to preprocess and stem

words in a document before passing them to TF. For example, you

might convert all words to lowercase, drop punctuation characters,

and drop suffixes like ing . For best results you can call a single-

node natural language library like NLTK in a map() .

Once you have built term frequency vectors, you can use IDF to compute the inverse

document frequencies, and multiply them with the term frequencies to compute the

TF-IDF. You first call fit() on an IDF object to obtain an IDFModel representing the

inverse document frequencies in the corpus, then call transform() on the model to

transform TF vectors into IDF vectors. Example 11-8 shows how you would compute

IDF starting with Example 11-7 .

Example 11-8. Using TF-IDF in Python

from pyspark.mllib.feature import HashingTF , IDF

# Read a set of text files as TF vectors

rdd = sc . wholeTextFiles ( "data" ) . map ( lambda ( name , text ): text . split ())

tf = HashingTF ()

tfVectors = tf . transform ( rdd ) . cache ()

# Compute the IDF, then the TF-IDF vectors

idf = IDF ()

idfModel = idf . fit ( tfVectors )

tfIdfVectors = idfModel . transform ( tfVectors )

Note that we called cache() on the tfVectors RDD because it is used twice (once to

train the IDF model, and once to multiply the TF vectors by the IDF).

Scaling

Most machine learning algorithms consider the magnitude of each element in the

feature vector, and thus work best when the features are scaled so they weigh equally

(e.g., all features have a mean of 0 and standard deviation of 1). Once you have built

feature vectors, you can use the StandardScaler class in MLlib to do this scaling,

both for the mean and the standard deviation. You create a StandardScaler , call

fit() on a dataset to obtain a StandardScalerModel (i.e., compute the mean and

variance of each column), and then call transform() on the model to scale a dataset.

Example 11-9 demonstrates.

Example 11-9. Scaling vectors in Python

from pyspark.mllib.feature import StandardScaler

vectors = [ Vectors . dense ([ - 2.0 , 5.0 , 1.0 ]), Vectors . dense ([ 2.0 , 0.0 , 1.0 ])]

Search WWH ::

Custom Search

Home