Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Evaluating k for SVD on the LFW dataset

We will examine the singular values obtained from computing the SVD on our image data.

We can verify that the singular values are the same for each run and that they are returned

in decreasing order, as follows:

val sValues = (1 to 5).map { i => matrix.computeSVD(i,

computeU = false).s }

sValues.foreach(println)

This should show us output similar to the following:

[54091.00997110354]

[54091.00997110358,33757.702867982436]

[54091.00997110357,33757.70286798241,24541.193694775946]

[54091.00997110358,33757.70286798242,24541.19369477593,23309.58418888302]

[54091.00997110358,33757.70286798242,24541.19369477593,23309.584188882982,21803.09841158358]

As with evaluating values of k for clustering, in the case of SVD (and PCA), it is often use-

ful to plot the singular values for a larger range of k and see where the point on the graph is

where the amount of additional variance accounted for by each additional singular value

starts to flatten out considerably.

We will do this by first computing the top 300 singular values:

val svd300 = matrix.computeSVD(300, computeU = false)

val sMatrix = new DenseMatrix(1, 300, svd300.s.toArray)

csvwrite(new File("/tmp/s.csv"), sMatrix)

We will write out the vector S of singular values to a temporary CSV file (as we did for our

matrix of Eigenfaces previously) and then read it back in our IPython console, plotting the

singular values for each k :

s = np.loadtxt("/tmp/s.csv", delimiter=",")

print(s.shape)

plot(s)

You should see an image displayed similar to the one shown here:

Search WWH ::

Custom Search

Home