Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

at RDDFunctions.scala:111, took 0.495859 s

scaler: org.apache.spark.mllib.feature.StandardScalerModel

=

org.apache.spark.mllib.feature.StandardScalerModel@6bb1a1a1

Tip

Note that subtracting the mean works for dense input data. However, for sparse vectors,

subtracting the mean vector from each input will transform the sparse data into dense data.

For very high-dimensional input, this will likely exhaust the available memory resources,

so it is not advisable.

Finally, we will use the returned scaler to transform the raw image vectors to vectors

with the column means subtracted:

val scaledVectors = vectors.map(v => scaler.transform(v))

We mentioned earlier that the resized grayscale images would take up around 10 MB of

memory. Indeed, you can take a look at the memory usage in the Spark application monit-

or storage page by going to http://localhost:4040/storage/ in your web

browser.

Since we gave our RDD of image vectors a friendly name of image-vectors , you

should see something like the following screenshot (note that as we are using Vect-

or[Double] , each element takes up 8 bytes instead of 4 bytes; hence, we actually use

20 MB of memory):

Size of image vectors in memory

Search WWH ::

Custom Search

Home