Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

Figure 11-1. Typical steps in a machine learning pipeline

Finally, most learning algorithms have multiple parameters that can affect results, so

real-world pipelines will train multiple versions of a model and evaluate each one. To

do this, it is common to separate the input data into “training” and “test” sets, and

train only on the former, so that the test set can be used to see whether the model

overfit the training data. MLlib provides several algorithms for model evaluation.

Example: Spam Classification

As a quick example of MLlib, we show a very simple program for building a spam

classifier (Examples 11-1 through 11-3 ). This program uses two MLlib algorithms:

HashingTF , which builds term frequency feature vectors from text data, and Logistic

RegressionWithSGD , which implements the logistic regression procedure using sto‐

chastic gradient descent (SGD). We assume that we start with two files, spam.txt and

normal.txt , each of which contains examples of spam and non-spam emails, one per

line. We then turn the text in each file into a feature vector with TF, and train a logis‐

tic regression model to separate the two types of messages. The code and data files are

available in the topic's Git repository.

Example 11-1. Spam classifier in Python

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.feature import HashingTF

from pyspark.mllib.classification import LogisticRegressionWithSGD

spam = sc . textFile ( "spam.txt" )

normal = sc . textFile ( "normal.txt" )

# Create a HashingTF instance to map email text to vectors of 10,000 features.

tf = HashingTF ( numFeatures = 10000 )

# Each email is split into words, and each word is mapped to one feature.

spamFeatures = spam . map ( lambda email : tf . transform ( email . split ( " " )))

normalFeatures = normal . map ( lambda email : tf . transform ( email . split ( " " )))

# Create LabeledPoint datasets for positive (spam) and negative (normal) examples.

Search WWH ::

Custom Search

Home