Database Reference
In-Depth Information
LabeledPoint
A labeled data point for supervised learning algorithms such as classification and
regression. Includes a feature vector and a label (which is a floating-point value).
Located in the
mllib.regression
package.
Rating
A rating of a product by a user, used in the
mllib.recommendation
package for
product recommendation.
Various
Model
classes
Each
Model
is the result of a training algorithm, and typically has a
predict()
method for applying the model to a new data point or to an RDD of new data
points.
Most algorithms work directly on RDDs of
Vector
s,
LabeledPoint
s, or
Rating
s. You
can construct these objects however you want, but typically you will build an RDD
through transformations on external data—for example, by loading a text file or run‐
ning a Spark SQL command—and then apply a
map()
to turn your data objects into
MLlib types.
Working with Vectors
There are a few points to note for the
Vector
class in MLlib, which will be the most
commonly used one.
First, vectors come in two flavors: dense and sparse. Dense vectors store all their
entries in an array of floating-point numbers. For example, a vector of size 100 will
contain 100
double
values. In contrast, sparse vectors store only the nonzero values
and their indices. Sparse vectors are usually preferable (both in terms of memory use
and speed) if at most 10% of elements are nonzero. Many featurization techniques
yield very sparse vectors, so using this representation is often a key optimization.
Second, the ways to construct vectors vary a bit by language. In Python, you can sim‐
ply pass a NumPy array anywhere in MLlib to represent a dense vector, or use the
mllib.linalg.Vectors
class to build vectors of other types (see
Example 11-4
).
2
In
Java and Scala, use the
mllib.linalg.Vectors
class (see Examples
11-5
and
11-6
).
Example 11-4. Creating vectors in Python
from
numpy
import
array
from
pyspark.mllib.linalg
import
Vectors
# Create the dense vector <1.0, 2.0, 3.0>
2
If you use SciPy, Spark also recognizes
scipy.sparse
matrices of size
N
×1 as length-
N
vectors.