Database Reference
In-Depth Information
0.20941776]
2-Norm of normalized_x: 1.0000
Using MLlib for feature normalization
Spark provides some built-in functions for feature scaling and standardization in its MLlib
machine learning library. These include StandardScaler , which applies the standard
normal transformation, and Normalizer , which applies the same feature vector normal-
ization we showed you in our preceding example code.
We will explore the use of these methods in the upcoming chapters, but for now, let's
simply compare the results of using MLlib's Normalizer to our own results:
from pyspark.mllib.feature import Normalizer
normalizer = Normalizer()
vector = sc.parallelize([x])
After importing the required class, we will instantiate Normalizer (by default, it will
use the L2 norm as we did earlier). Note that as in most situations in Spark, we need to
provide Normalizer with an RDD as input (it contains numpy arrays or MLlib vec-
tors); hence, we will create a single-element RDD from our vector x for illustrative pur-
poses.
We will then use the transform function of Normalizer on our RDD. Since the
RDD only has one vector in it, we will return our vector to the driver by calling first
and finally by calling the toArray function to convert the vector back into a numpy ar-
ray:
normalized_x_mllib =
normalizer.transform(vector).first().toArray()
Finally, we can print out the same details as we did previously, comparing the results:
print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x MLlib:\n%s" % normalized_x_mllib
print "2-Norm of normalized_x_mllib: %2.4f" %
np.linalg.norm(normalized_x_mllib)
Search WWH ::




Custom Search