Database Reference
In-Depth Information
class, which stores an RDD of Vector s, one per row. 8 You can then call PCA as
shown in Example 11-13 .
Example 11-13. PCA in Scala
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val points : RDD [ Vector ] = // ...
val mat : RowMatrix = new RowMatrix ( points )
val pc : Matrix = mat . computePrincipalComponents ( 2 )
// Project points to low-dimensional space
val projected = mat . multiply ( pc ). rows
// Train a k-means model on the projected 2-dimensional data
val model = KMeans . train ( projected , 10 )
In this example, the projected RDD contains a two-dimensional version of the origi‐
nal points RDD, and can be used for plotting or performing other MLlib algorithms,
such as clustering via K-means.
Note that computePrincipalComponents() returns a mllib.linalg.Matrix object,
which is a utility class representing dense matrices, similar to Vector . You can get at
the underlying data with toArray .
Singular value decomposition
MLlib also provides the lower-level singular value decomposition (SVD) primitive.
The SVD factorizes an m × n matrix A into three matrices A UΣV
T , where:
U is an orthonormal matrix, whose columns are called left singular vectors.
Σ is a diagonal matrix with nonnegative diagonals in descending order, whose
diagonals are called singular values.
V is an orthonormal matrix, whose columns are called right singular vectors.
For large matrices, usually we don't need the complete factorization but only the top
singular values and its associated singular vectors. This can save storage, denoise, and
recover the low-rank structure of the matrix. If we keep the top k singular values,
then the dimensions of the resulting matrices will be U : m × k , Σ : k × k , and
V : n × k .
8 In Java, start with a JavaRDD of Vector s, and then call .rdd() on it to convert it to a Scala RDD.
 
Search WWH ::




Custom Search