Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

44.0,47.0,47.0,49.0,62.0,116.0,173.0,223.0,232.0,233.0, ...

0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, ...

1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0, ...

26.0,26.0,27.0,26.0,24.0,24.0,25.0,26.0,27.0,27.0, ...

240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,

...

0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, ...

The final step is to create an MLlib Vector instance for each image. We will cache the

RDD to speed up our later computations:

import org.apache.spark.mllib.linalg.Vectors

val vectors = pixels.map(p => Vectors.dense(p))

vectors.setName("image-vectors")

vectors.cache

Tip

We used the setName function earlier to assign an RDD a name. In this case, we called it

image-vectors . This is so that we can later identify it more easily when looking at the

Spark web interface.

Normalization

It is a common practice to standardize input data prior to running dimensionality reduction

models, in particular for PCA. As we did in Chapter 5 , Building a Classification Model

with Spark , we will do this using the built-in StandardScaler provided by MLlib's

feature package. We will only subtract the mean from the data in this case:

import org.apache.spark.mllib.linalg.Matrix

import org.apache.spark.mllib.linalg.distributed.RowMatrix

import org.apache.spark.mllib.feature.StandardScaler

val scaler = new StandardScaler(withMean = true, withStd =

false).fit(vectors)

Calling fit triggers a computation on our RDD[Vector] . You should see output simil-

ar to the one shown here:

...

14/09/21 11:46:58 INFO SparkContext: Job finished: reduce

Search WWH ::

Custom Search

Home