Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

The output is as follows:

[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,

...

The following code statement will print the variance of the matrix:

println(matrixSummary.variance)

The output of the variance is:

[0.1097424416755897,74.30082476809638,0.04126316989120246,

...

The following code statement will print the nonzero number of the matrix:

println(matrixSummary.numNonzeros)

Here is the output:

[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,

...

The computeColumnSummaryStatistics method computes a number of statistics

over each column of features, including the mean and variance, storing each of these in a

Vector with one entry per column (that is, one entry per feature in our case).

Looking at the preceding output for mean and variance, we can see quite clearly that the

second feature has a much higher mean and variance than some of the other features (you

will find a few other features that are similar and a few others that are more extreme). So,

our data definitely does not conform to a standard Gaussian distribution in its raw form.

To get the data in a more suitable form for our models, we can standardize each feature

such that it has zero mean and unit standard deviation. We can do this by subtracting the

column mean from each feature value and then scaling it by dividing it by the column

standard deviation for the feature:

(x - μ) / sqrt(variance)

Practically, for each feature vector in our input dataset, we can simply perform an

element-wise subtraction of the preceding mean vector from the feature vector and then

perform an element-wise division of the feature vector by the vector of feature standard

Search WWH ::

Custom Search

Home