Database Reference
In-Depth Information
The output is as follows:
[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,
...
The following code statement will print the variance of the matrix:
println(matrixSummary.variance)
The output of the variance is:
[0.1097424416755897,74.30082476809638,0.04126316989120246,
...
The following code statement will print the nonzero number of the matrix:
println(matrixSummary.numNonzeros)
Here is the output:
[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,
...
The computeColumnSummaryStatistics method computes a number of statistics
over each column of features, including the mean and variance, storing each of these in a
Vector with one entry per column (that is, one entry per feature in our case).
Looking at the preceding output for mean and variance, we can see quite clearly that the
second feature has a much higher mean and variance than some of the other features (you
will find a few other features that are similar and a few others that are more extreme). So,
our data definitely does not conform to a standard Gaussian distribution in its raw form.
To get the data in a more suitable form for our models, we can standardize each feature
such that it has zero mean and unit standard deviation. We can do this by subtracting the
column mean from each feature value and then scaling it by dividing it by the column
standard deviation for the feature:
(x - μ) / sqrt(variance)
Practically, for each feature vector in our input dataset, we can simply perform an
element-wise subtraction of the preceding mean vector from the feature vector and then
perform an element-wise division of the feature vector by the vector of feature standard
Search WWH ::




Custom Search