Database Reference
In-Depth Information
import
org.apache.spark.mllib.feature.HashingTF
;
import
org.apache.spark.mllib.linalg.Vector
;
import
org.apache.spark.mllib.regression.LabeledPoint
;
JavaRDD
<
String
>
spam
=
sc
.
textFile
(
"spam.txt"
);
JavaRDD
<
String
>
normal
=
sc
.
textFile
(
"normal.txt"
);
// Create a HashingTF instance to map email text to vectors of 10,000 features.
final
HashingTF
tf
=
new
HashingTF
(
10000
);
// Create LabeledPoint datasets for positive (spam) and negative (normal) examples.
JavaRDD
<
LabeledPoint
>
posExamples
=
spam
.
map
(
new
Function
<
String
,
LabeledPoint
>()
{
public
LabeledPoint
call
(
String
email
)
{
return
new
LabeledPoint
(
1
,
tf
.
transform
(
Arrays
.
asList
(
email
.
split
(
" "
))));
}
});
JavaRDD
<
LabeledPoint
>
negExamples
=
normal
.
map
(
new
Function
<
String
,
LabeledPoint
>()
{
public
LabeledPoint
call
(
String
email
)
{
return
new
LabeledPoint
(
0
,
tf
.
transform
(
Arrays
.
asList
(
email
.
split
(
" "
))));
}
});
JavaRDD
<
LabeledPoint
>
trainData
=
positiveExamples
.
union
(
negativeExamples
);
trainData
.
cache
();
// Cache since Logistic Regression is an iterative algorithm.
// Run Logistic Regression using the SGD algorithm.
LogisticRegressionModel
model
=
new
LogisticRegressionWithSGD
().
run
(
trainData
.
rdd
());
// Test on a positive example (spam) and a negative one (normal).
Vector
posTest
=
tf
.
transform
(
Arrays
.
asList
(
"O M G GET cheap stuff by sending money to ..."
.
split
(
" "
)));
Vector
negTest
=
tf
.
transform
(
Arrays
.
asList
(
"Hi Dad, I started studying Spark the other ..."
.
split
(
" "
)));
System
.
out
.
println
(
"Prediction for positive example: "
+
model
.
predict
(
posTest
));
System
.
out
.
println
(
"Prediction for negative example: "
+
model
.
predict
(
negTest
));
As you can see, the code is fairly similar in all the languages. It operates directly on
RDDs—in this case, of strings (the original text) and
LabeledPoint
s (an MLlib data
type for a vector of features together with a label).
Data Types
MLlib contains a few specific data types, located in the
org.apache.spark.mllib
package (Java/Scala) or
pyspark.mllib
(Python). The main ones are:
Vector
A mathematical vector. MLlib supports both dense vectors, where every entry is
stored, and sparse vectors, where only the nonzero entries are stored to save
space. We will discuss the different types of vectors shortly. Vectors can be con‐
structed with the
mllib.linalg.Vectors
class.