Database Reference
In-Depth Information
LabeledPoint
A labeled data point for supervised learning algorithms such as classification and
regression. Includes a feature vector and a label (which is a floating-point value).
Located in the mllib.regression package.
Rating
A rating of a product by a user, used in the mllib.recommendation package for
product recommendation.
Various Model classes
Each Model is the result of a training algorithm, and typically has a predict()
method for applying the model to a new data point or to an RDD of new data
points.
Most algorithms work directly on RDDs of Vector s, LabeledPoint s, or Rating s. You
can construct these objects however you want, but typically you will build an RDD
through transformations on external data—for example, by loading a text file or run‐
ning a Spark SQL command—and then apply a map() to turn your data objects into
MLlib types.
Working with Vectors
There are a few points to note for the Vector class in MLlib, which will be the most
commonly used one.
First, vectors come in two flavors: dense and sparse. Dense vectors store all their
entries in an array of floating-point numbers. For example, a vector of size 100 will
contain 100 double values. In contrast, sparse vectors store only the nonzero values
and their indices. Sparse vectors are usually preferable (both in terms of memory use
and speed) if at most 10% of elements are nonzero. Many featurization techniques
yield very sparse vectors, so using this representation is often a key optimization.
Second, the ways to construct vectors vary a bit by language. In Python, you can sim‐
ply pass a NumPy array anywhere in MLlib to represent a dense vector, or use the
mllib.linalg.Vectors class to build vectors of other types (see Example 11-4 ). 2 In
Java and Scala, use the mllib.linalg.Vectors class (see Examples 11-5 and 11-6 ).
Example 11-4. Creating vectors in Python
from numpy import array
from pyspark.mllib.linalg import Vectors
# Create the dense vector <1.0, 2.0, 3.0>
2 If you use SciPy, Spark also recognizes scipy.sparse matrices of size N ×1 as length- N vectors.
 
Search WWH ::




Custom Search