Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

deviations. The standard deviation vector itself can be obtained by performing an element-

wise square root operation on the variance vector.

As we mentioned in Chapter 3 , Obtaining, Processing, and Preparing Data with Spark ,

we fortunately have access to a convenience method from Spark's StandardScaler to

accomplish this.

StandardScaler works in much the same way as the Normalizer feature we used

in that chapter. We will instantiate it by passing in two arguments that tell it whether to

subtract the mean from the data and whether to apply standard deviation scaling. We will

then fit StandardScaler on our input vectors . Finally, we will pass in an input

vector to the transform function, which will then return a normalized vector. We will

do this within the following map function to preserve the label from our dataset:

import org.apache.spark.mllib.feature.StandardScaler

val scaler = new StandardScaler(withMean = true, withStd =

true).fit(vectors)

val scaledData = data.map(lp => LabeledPoint(lp.label,

scaler.transform(lp.features)))

Our data should now be standardized. Let's inspect the first row of the original and stand-

ardized features:

println(data.first.features)

The output of the preceding line of code is as follows:

[0.789131,2.055555556,0.676470588,0.205882353,

The following code will the first row of the standardized features:

println(scaledData.first.features)

The output is as follows:

[1.1376439023494747,-0.08193556218743517,1.025134766284205,-0.0558631837375738,

As we can see, the first feature has been transformed by applying the standardization for-

mula. We can check this by subtracting the mean (which we computed earlier) from the

Search WWH ::

Custom Search

Home