Database Reference
In-Depth Information
Feature standardization
Many models that we employ make inherent assumptions about the distribution or scale of
input data. One of the most common forms of assumption is about normally-distributed
features. Let's take a deeper look at the distribution of our features.
To do this, we can represent the feature vectors as a distributed matrix in MLlib, using the
RowMatrix class. RowMatrix is an RDD made up of vector, where each vector is a row
of our matrix.
The RowMatrix class comes with some useful methods to operate on the matrix, one of
which is a utility to compute statistics on the columns of the matrix:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val vectors = data.map(lp => lp.features)
val matrix = new RowMatrix(vectors)
val matrixSummary = matrix.computeColumnSummaryStatistics()
The following code statement will print the mean of the matrix:
println(matrixSummary.mean)
Here is the output:
[0.41225805299526636,2.761823191986623,0.46823047328614004,
...
The following code statement will print the minimum value of the matrix:
println(matrixSummary.min)
Here is the output:
[0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564223,-1.0,
...
The following code statement will print the maximum value of the matrix:
println(matrixSummary.max)
Search WWH ::




Custom Search