Database Reference
In-Depth Information
deviations. The standard deviation vector itself can be obtained by performing an element-
wise square root operation on the variance vector.
we fortunately have access to a convenience method from Spark's
StandardScaler
to
accomplish this.
StandardScaler
works in much the same way as the
Normalizer
feature we used
in that chapter. We will instantiate it by passing in two arguments that tell it whether to
subtract the mean from the data and whether to apply standard deviation scaling. We will
then fit
StandardScaler
on our input
vectors
. Finally, we will pass in an input
vector to the
transform
function, which will then return a normalized vector. We will
do this within the following
map
function to preserve the
label
from our dataset:
import org.apache.spark.mllib.feature.StandardScaler
val scaler = new StandardScaler(withMean = true, withStd =
true).fit(vectors)
val scaledData = data.map(lp => LabeledPoint(lp.label,
scaler.transform(lp.features)))
Our data should now be standardized. Let's inspect the first row of the original and stand-
ardized features:
println(data.first.features)
The output of the preceding line of code is as follows:
[0.789131,2.055555556,0.676470588,0.205882353,
The following code will the first row of the standardized features:
println(scaledData.first.features)
The output is as follows:
[1.1376439023494747,-0.08193556218743517,1.025134766284205,-0.0558631837375738,
As we can see, the first feature has been transformed by applying the standardization for-
mula. We can check this by subtracting the mean (which we computed earlier) from the