Database Reference
In-Depth Information
deviations. The standard deviation vector itself can be obtained by performing an element-
wise square root operation on the variance vector.
As we mentioned in Chapter 3 , Obtaining, Processing, and Preparing Data with Spark ,
we fortunately have access to a convenience method from Spark's StandardScaler to
accomplish this.
StandardScaler works in much the same way as the Normalizer feature we used
in that chapter. We will instantiate it by passing in two arguments that tell it whether to
subtract the mean from the data and whether to apply standard deviation scaling. We will
then fit StandardScaler on our input vectors . Finally, we will pass in an input
vector to the transform function, which will then return a normalized vector. We will
do this within the following map function to preserve the label from our dataset:
import org.apache.spark.mllib.feature.StandardScaler
val scaler = new StandardScaler(withMean = true, withStd =
true).fit(vectors)
val scaledData = data.map(lp => LabeledPoint(lp.label,
scaler.transform(lp.features)))
Our data should now be standardized. Let's inspect the first row of the original and stand-
ardized features:
println(data.first.features)
The output of the preceding line of code is as follows:
[0.789131,2.055555556,0.676470588,0.205882353,
The following code will the first row of the standardized features:
println(scaledData.first.features)
The output is as follows:
[1.1376439023494747,-0.08193556218743517,1.025134766284205,-0.0558631837375738,
As we can see, the first feature has been transformed by applying the standardization for-
mula. We can check this by subtracting the mean (which we computed earlier) from the
Search WWH ::




Custom Search