Database Reference
In-Depth Information
positiveExamples
=
spamFeatures
.
map
(
lambda
features
:
LabeledPoint
(
1
,
features
))
negativeExamples
=
normalFeatures
.
map
(
lambda
features
:
LabeledPoint
(
0
,
features
))
trainingData
=
positiveExamples
.
union
(
negativeExamples
)
trainingData
.
cache
()
# Cache since Logistic Regression is an iterative algorithm.
# Run Logistic Regression using the SGD algorithm.
model
=
LogisticRegressionWithSGD
.
train
(
trainingData
)
# Test on a positive example (spam) and a negative one (normal). We first apply
# the same HashingTF feature transformation to get vectors, then apply the model.
posTest
=
tf
.
transform
(
"O M G GET cheap stuff by sending money to ..."
.
split
(
" "
))
negTest
=
tf
.
transform
(
"Hi Dad, I started studying Spark the other ..."
.
split
(
" "
))
print
"Prediction for positive test example:
%g
"
%
model
.
predict
(
posTest
)
print
"Prediction for negative test example:
%g
"
%
model
.
predict
(
negTest
)
Example 11-2. Spam classifier in Scala
import
org.apache.spark.mllib.regression.LabeledPoint
import
org.apache.spark.mllib.feature.HashingTF
import
org.apache.spark.mllib.classification.LogisticRegressionWithSGD
val
spam
=
sc
.
textFile
(
"spam.txt"
)
val
normal
=
sc
.
textFile
(
"normal.txt"
)
// Create a HashingTF instance to map email text to vectors of 10,000 features.
val
tf
=
new
HashingTF
(
numFeatures
=
10000
)
// Each email is split into words, and each word is mapped to one feature.
val
spamFeatures
=
spam
.
map
(
email
=>
tf
.
transform
(
email
.
split
(
" "
)))
val
normalFeatures
=
normal
.
map
(
email
=>
tf
.
transform
(
email
.
split
(
" "
)))
// Create LabeledPoint datasets for positive (spam) and negative (normal) examples.
val
positiveExamples
=
spamFeatures
.
map
(
features
=>
LabeledPoint
(
1
,
features
))
val
negativeExamples
=
normalFeatures
.
map
(
features
=>
LabeledPoint
(
0
,
features
))
val
trainingData
=
positiveExamples
.
union
(
negativeExamples
)
trainingData
.
cache
()
// Cache since Logistic Regression is an iterative algorithm.
// Run Logistic Regression using the SGD algorithm.
val
model
=
new
LogisticRegressionWithSGD
().
run
(
trainingData
)
// Test on a positive example (spam) and a negative one (normal).
val
posTest
=
tf
.
transform
(
"O M G GET cheap stuff by sending money to ..."
.
split
(
" "
))
val
negTest
=
tf
.
transform
(
"Hi Dad, I started studying Spark the other ..."
.
split
(
" "
))
println
(
"Prediction for positive test example: "
+
model
.
predict
(
posTest
))
println
(
"Prediction for negative test example: "
+
model
.
predict
(
negTest
))
Example 11-3. Spam classifier in Java
import
org.apache.spark.mllib.classification.LogisticRegressionModel
;
import
org.apache.spark.mllib.classification.LogisticRegressionWithSGD
;