Database Reference
In-Depth Information
Figure 11-1. Typical steps in a machine learning pipeline
Finally, most learning algorithms have multiple parameters that can affect results, so
real-world pipelines will train multiple versions of a model and
evaluate
each one. To
do this, it is common to separate the input data into “training” and “test” sets, and
train only on the former, so that the test set can be used to see whether the model
overfit
the training data. MLlib provides several algorithms for model evaluation.
Example: Spam Classification
As a quick example of MLlib, we show a very simple program for building a spam
classifier (Examples
11-1
through
11-3
). This program uses two MLlib algorithms:
HashingTF
, which builds
term frequency
feature vectors from text data, and
Logistic
RegressionWithSGD
, which implements the logistic regression procedure using
sto‐
chastic gradient descent
(SGD). We assume that we start with two files,
spam.txt
and
normal.txt
, each of which contains examples of spam and non-spam emails, one per
line. We then turn the text in each file into a feature vector with TF, and train a logis‐
tic regression model to separate the two types of messages. The code and data files are
available in the topic's Git repository.
Example 11-1. Spam classifier in Python
from
pyspark.mllib.regression
import
LabeledPoint
from
pyspark.mllib.feature
import
HashingTF
from
pyspark.mllib.classification
import
LogisticRegressionWithSGD
spam
=
sc
.
textFile
(
"spam.txt"
)
normal
=
sc
.
textFile
(
"normal.txt"
)
# Create a HashingTF instance to map email text to vectors of 10,000 features.
tf
=
HashingTF
(
numFeatures
=
10000
)
# Each email is split into words, and each word is mapped to one feature.
spamFeatures
=
spam
.
map
(
lambda
email
:
tf
.
transform
(
email
.
split
(
" "
)))
normalFeatures
=
normal
.
map
(
lambda
email
:
tf
.
transform
(
email
.
split
(
" "
)))
# Create LabeledPoint datasets for positive (spam) and negative (normal) examples.