Database Reference
In-Depth Information
0.20941776]
2-Norm of normalized_x: 1.0000
Using MLlib for feature normalization
Spark provides some built-in functions for feature scaling and standardization in its MLlib
machine learning library. These include
StandardScaler
, which applies the standard
normal transformation, and
Normalizer
, which applies the same feature vector normal-
ization we showed you in our preceding example code.
We will explore the use of these methods in the upcoming chapters, but for now, let's
simply compare the results of using MLlib's
Normalizer
to our own results:
from pyspark.mllib.feature import Normalizer
normalizer = Normalizer()
vector = sc.parallelize([x])
After importing the required class, we will instantiate
Normalizer
(by default, it will
use the L2 norm as we did earlier). Note that as in most situations in Spark, we need to
provide
Normalizer
with an RDD as input (it contains
numpy
arrays or MLlib vec-
tors); hence, we will create a single-element RDD from our vector
x
for illustrative pur-
poses.
We will then use the
transform
function of
Normalizer
on our RDD. Since the
RDD only has one vector in it, we will return our vector to the driver by calling
first
and finally by calling the
toArray
function to convert the vector back into a
numpy
ar-
ray:
normalized_x_mllib =
normalizer.transform(vector).first().toArray()
Finally, we can print out the same details as we did previously, comparing the results:
print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x MLlib:\n%s" % normalized_x_mllib
print "2-Norm of normalized_x_mllib: %2.4f" %
np.linalg.norm(normalized_x_mllib)