Database Reference
In-Depth Information
Figure 11-1. Typical steps in a machine learning pipeline
Finally, most learning algorithms have multiple parameters that can affect results, so
real-world pipelines will train multiple versions of a model and evaluate each one. To
do this, it is common to separate the input data into “training” and “test” sets, and
train only on the former, so that the test set can be used to see whether the model
overfit the training data. MLlib provides several algorithms for model evaluation.
Example: Spam Classification
As a quick example of MLlib, we show a very simple program for building a spam
classifier (Examples 11-1 through 11-3 ). This program uses two MLlib algorithms:
HashingTF , which builds term frequency feature vectors from text data, and Logistic
RegressionWithSGD , which implements the logistic regression procedure using sto‐
chastic gradient descent (SGD). We assume that we start with two files, spam.txt and
normal.txt , each of which contains examples of spam and non-spam emails, one per
line. We then turn the text in each file into a feature vector with TF, and train a logis‐
tic regression model to separate the two types of messages. The code and data files are
available in the topic's Git repository.
Example 11-1. Spam classifier in Python
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD
spam = sc . textFile ( "spam.txt" )
normal = sc . textFile ( "normal.txt" )
# Create a HashingTF instance to map email text to vectors of 10,000 features.
tf = HashingTF ( numFeatures = 10000 )
# Each email is split into words, and each word is mapped to one feature.
spamFeatures = spam . map ( lambda email : tf . transform ( email . split ( " " )))
normalFeatures = normal . map ( lambda email : tf . transform ( email . split ( " " )))
# Create LabeledPoint datasets for positive (spam) and negative (normal) examples.
 
Search WWH ::




Custom Search