Database Reference
In-Depth Information
class, which stores an RDD of
Vector
s, one per row.
8
You can then call PCA as
shown in
Example 11-13
.
Example 11-13. PCA in Scala
import
org.apache.spark.mllib.linalg.Matrix
import
org.apache.spark.mllib.linalg.distributed.RowMatrix
val
points
:
RDD
[
Vector
]
=
// ...
val
mat
:
RowMatrix
=
new
RowMatrix
(
points
)
val
pc
:
Matrix
=
mat
.
computePrincipalComponents
(
2
)
// Project points to low-dimensional space
val
projected
=
mat
.
multiply
(
pc
).
rows
// Train a k-means model on the projected 2-dimensional data
val
model
=
KMeans
.
train
(
projected
,
10
)
In this example, the projected RDD contains a two-dimensional version of the origi‐
nal
points
RDD, and can be used for plotting or performing other MLlib algorithms,
such as clustering via K-means.
Note that
computePrincipalComponents()
returns a
mllib.linalg.Matrix
object,
which is a utility class representing dense matrices, similar to
Vector
. You can get at
the underlying data with
toArray
.
Singular value decomposition
MLlib also provides the lower-level singular value decomposition (SVD) primitive.
The SVD factorizes an
m
×
n
matrix
A
into three matrices
A
≈
UΣV
T
, where:
•
U
is an orthonormal matrix, whose columns are called left singular vectors.
•
Σ
is a diagonal matrix with nonnegative diagonals in descending order, whose
diagonals are called singular values.
•
V
is an orthonormal matrix, whose columns are called right singular vectors.
For large matrices, usually we don't need the complete factorization but only the top
singular values and its associated singular vectors. This can save storage, denoise, and
recover the low-rank structure of the matrix. If we keep the top
k
singular values,
then the dimensions of the resulting matrices will be
U
:
m
×
k
,
Σ
:
k
×
k
, and
V
:
n
×
k
.
8
In Java, start with a JavaRDD of
Vector
s, and then call
.rdd()
on it to convert it to a Scala RDD.