Database Reference
In-Depth Information
import org.apache.spark.mllib.feature.HashingTF ;
import org.apache.spark.mllib.linalg.Vector ;
import org.apache.spark.mllib.regression.LabeledPoint ;
JavaRDD < String > spam = sc . textFile ( "spam.txt" );
JavaRDD < String > normal = sc . textFile ( "normal.txt" );
// Create a HashingTF instance to map email text to vectors of 10,000 features.
final HashingTF tf = new HashingTF ( 10000 );
// Create LabeledPoint datasets for positive (spam) and negative (normal) examples.
JavaRDD < LabeledPoint > posExamples = spam . map ( new Function < String , LabeledPoint >() {
public LabeledPoint call ( String email ) {
return new LabeledPoint ( 1 , tf . transform ( Arrays . asList ( email . split ( " " ))));
}
});
JavaRDD < LabeledPoint > negExamples = normal . map ( new Function < String , LabeledPoint >() {
public LabeledPoint call ( String email ) {
return new LabeledPoint ( 0 , tf . transform ( Arrays . asList ( email . split ( " " ))));
}
});
JavaRDD < LabeledPoint > trainData = positiveExamples . union ( negativeExamples );
trainData . cache (); // Cache since Logistic Regression is an iterative algorithm.
// Run Logistic Regression using the SGD algorithm.
LogisticRegressionModel model = new LogisticRegressionWithSGD (). run ( trainData . rdd ());
// Test on a positive example (spam) and a negative one (normal).
Vector posTest = tf . transform (
Arrays . asList ( "O M G GET cheap stuff by sending money to ..." . split ( " " )));
Vector negTest = tf . transform (
Arrays . asList ( "Hi Dad, I started studying Spark the other ..." . split ( " " )));
System . out . println ( "Prediction for positive example: " + model . predict ( posTest ));
System . out . println ( "Prediction for negative example: " + model . predict ( negTest ));
As you can see, the code is fairly similar in all the languages. It operates directly on
RDDs—in this case, of strings (the original text) and LabeledPoint s (an MLlib data
type for a vector of features together with a label).
Data Types
MLlib contains a few specific data types, located in the org.apache.spark.mllib
package (Java/Scala) or pyspark.mllib (Python). The main ones are:
Vector
A mathematical vector. MLlib supports both dense vectors, where every entry is
stored, and sparse vectors, where only the nonzero entries are stored to save
space. We will discuss the different types of vectors shortly. Vectors can be con‐
structed with the mllib.linalg.Vectors class.
Search WWH ::




Custom Search