Database Reference
In-Depth Information
44.0,47.0,47.0,49.0,62.0,116.0,173.0,223.0,232.0,233.0, ...
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, ...
1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0, ...
26.0,26.0,27.0,26.0,24.0,24.0,25.0,26.0,27.0,27.0, ...
240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,
...
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, ...
The final step is to create an MLlib Vector instance for each image. We will cache the
RDD to speed up our later computations:
import org.apache.spark.mllib.linalg.Vectors
val vectors = pixels.map(p => Vectors.dense(p))
vectors.setName("image-vectors")
vectors.cache
Tip
We used the setName function earlier to assign an RDD a name. In this case, we called it
image-vectors . This is so that we can later identify it more easily when looking at the
Spark web interface.
Normalization
It is a common practice to standardize input data prior to running dimensionality reduction
models, in particular for PCA. As we did in Chapter 5 , Building a Classification Model
with Spark , we will do this using the built-in StandardScaler provided by MLlib's
feature package. We will only subtract the mean from the data in this case:
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.feature.StandardScaler
val scaler = new StandardScaler(withMean = true, withStd =
false).fit(vectors)
Calling fit triggers a computation on our RDD[Vector] . You should see output simil-
ar to the one shown here:
...
14/09/21 11:46:58 INFO SparkContext: Job finished: reduce
Search WWH ::




Custom Search