Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

positiveExamples = spamFeatures . map ( lambda features : LabeledPoint ( 1 , features ))

negativeExamples = normalFeatures . map ( lambda features : LabeledPoint ( 0 , features ))

trainingData = positiveExamples . union ( negativeExamples )

trainingData . cache () # Cache since Logistic Regression is an iterative algorithm.

# Run Logistic Regression using the SGD algorithm.

model = LogisticRegressionWithSGD . train ( trainingData )

# Test on a positive example (spam) and a negative one (normal). We first apply

# the same HashingTF feature transformation to get vectors, then apply the model.

posTest = tf . transform ( "O M G GET cheap stuff by sending money to ..." . split ( " " ))

negTest = tf . transform ( "Hi Dad, I started studying Spark the other ..." . split ( " " ))

print "Prediction for positive test example: %g " % model . predict ( posTest )

print "Prediction for negative test example: %g " % model . predict ( negTest )

Example 11-2. Spam classifier in Scala

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.feature.HashingTF

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

val spam = sc . textFile ( "spam.txt" )

val normal = sc . textFile ( "normal.txt" )

// Create a HashingTF instance to map email text to vectors of 10,000 features.

val tf = new HashingTF ( numFeatures = 10000 )

// Each email is split into words, and each word is mapped to one feature.

val spamFeatures = spam . map ( email => tf . transform ( email . split ( " " )))

val normalFeatures = normal . map ( email => tf . transform ( email . split ( " " )))

// Create LabeledPoint datasets for positive (spam) and negative (normal) examples.

val positiveExamples = spamFeatures . map ( features => LabeledPoint ( 1 , features ))

val negativeExamples = normalFeatures . map ( features => LabeledPoint ( 0 , features ))

val trainingData = positiveExamples . union ( negativeExamples )

trainingData . cache () // Cache since Logistic Regression is an iterative algorithm.

// Run Logistic Regression using the SGD algorithm.

val model = new LogisticRegressionWithSGD (). run ( trainingData )

// Test on a positive example (spam) and a negative one (normal).

val posTest = tf . transform (

"O M G GET cheap stuff by sending money to ..." . split ( " " ))

val negTest = tf . transform (

"Hi Dad, I started studying Spark the other ..." . split ( " " ))

println ( "Prediction for positive test example: " + model . predict ( posTest ))

println ( "Prediction for negative test example: " + model . predict ( negTest ))

Example 11-3. Spam classifier in Java

import org.apache.spark.mllib.classification.LogisticRegressionModel ;

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD ;

Search WWH ::

Custom Search

Home