Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

Feature Extraction

The mllib.feature package contains several classes for common feature transforma‐

tions. These include algorithms to construct feature vectors from text (or from other

tokens), and ways to normalize and scale features.

TF-IDF

Term Frequency-Inverse Document Frequency, or TF-IDF, is a simple way to gener‐

ate feature vectors from text documents (e.g., web pages). It computes two statistics

for each term in each document: the term frequency (TF), which is the number of

times the term occurs in that document, and the inverse document frequency (IDF),

which measures how (in)frequently a term occurs across the whole document corpus.

The product of these values, TF × IDF, shows how relevant a term is to a specific

document (i.e., if it is common in that document but rare in the whole corpus).

MLlib has two algorithms that compute TF-IDF: HashingTF and IDF , both in the

mllib.feature package. HashingTF computes a term frequency vector of a given size

from a document. In order to map terms to vector indices, it uses a technique known

as the hashing trick . Within a language like English, there are hundreds of thousands

of words, so tracking a distinct mapping from each word to an index in the vector

would be expensive. Instead, HashingTF takes the hash code of each word modulo a

desired vector size, S , and thus maps each word to a number between 0 and S -1. This

always yields an S -dimensional vector, and in practice is quite robust even if multiple

words map to the same hash code. The MLlib developers recommend setting S

between 2 18 and 2 20 .

HashingTF can run either on one document at a time or on a whole RDD. It requires

each “document” to be represented as an iterable sequence of objects—for instance, a

list in Python or a Collection in Java. Example 11-7 uses HashingTF in Python.

Example 11-7. Using HashingTF in Python

>>> from pyspark.mllib.feature import HashingTF

>>> sentence = "hello hello world"

>>> words = sentence . split () # Split sentence into a list of terms

>>> tf = HashingTF ( 10000 ) # Create vectors of size S = 10,000

>>> tf . transform ( words )

SparseVector ( 10000 , { 3065 : 1.0 , 6861 : 2.0 })

>>> rdd = sc . wholeTextFiles ( "data" ) . map ( lambda ( name , text ): text . split ())

>>> tfVectors = tf . transform ( rdd ) # Transforms an entire RDD

Search WWH ::

Custom Search

Home