Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

0.20941776]

2-Norm of normalized_x: 1.0000

Using MLlib for feature normalization

Spark provides some built-in functions for feature scaling and standardization in its MLlib

machine learning library. These include StandardScaler , which applies the standard

normal transformation, and Normalizer , which applies the same feature vector normal-

ization we showed you in our preceding example code.

We will explore the use of these methods in the upcoming chapters, but for now, let's

simply compare the results of using MLlib's Normalizer to our own results:

from pyspark.mllib.feature import Normalizer

normalizer = Normalizer()

vector = sc.parallelize([x])

After importing the required class, we will instantiate Normalizer (by default, it will

use the L2 norm as we did earlier). Note that as in most situations in Spark, we need to

provide Normalizer with an RDD as input (it contains numpy arrays or MLlib vec-

tors); hence, we will create a single-element RDD from our vector x for illustrative pur-

poses.

We will then use the transform function of Normalizer on our RDD. Since the

RDD only has one vector in it, we will return our vector to the driver by calling first

and finally by calling the toArray function to convert the vector back into a numpy ar-

ray:

normalized_x_mllib =

normalizer.transform(vector).first().toArray()

Finally, we can print out the same details as we did previously, comparing the results:

print "x:\n%s" % x

print "2-Norm of x: %2.4f" % norm_x_2

print "Normalized x MLlib:\n%s" % normalized_x_mllib

print "2-Norm of normalized_x_mllib: %2.4f" %

np.linalg.norm(normalized_x_mllib)

Search WWH ::

Custom Search

Home