Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Singular Value Decomposition

SVD seeks to decompose a matrix X of dimension m x n into three component matrices:

• U of dimension m x m

• S , a diagonal matrix of size m x n ; the entries of S are referred to as the singular

values

• V T of dimension n x n

X = U * S * V T

Looking at the preceding formula, it appears that we have not reduced the dimensionality

of the problem at all, as by multiplying U , S , and V , we reconstruct the original matrix. In

practice, the truncated SVD is usually computed. That is, only the top k singular values,

which represent the most variation in the data, are kept, while the rest are discarded. The

formula to reconstruct X based on the component matrices is then approximate:

X ~ U k * S k * V k T

An illustration of the truncated SVD is shown here:

The truncated Singular Value Decomposition

Keeping the top k singular values is similar to keeping the top k principal components in

PCA. In fact, SVD and PCA are directly related, as we will see a little later in this chapter.

Note

A detailed mathematical treatment of both PCA and SVD is beyond the scope of this topic.

An overview of dimensionality reduction can be found in the Spark documentation at ht-

Search WWH ::

Custom Search

Home