Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

feature vector is correct (and is somewhat more efficient than, say, creating many smaller

binary vectors and concatenating them). The numeric vector is created directly by first

converting the data to floating point numbers and wrapping these in a numpy array. The

resulting two vectors are then concatenated. The extract_label function simply con-

verts the last column variable (the count) into a float.

With our utility functions defined, we can proceed with extracting feature vectors and la-

bels from our data records:

data = records.map(lambda r: LabeledPoint(extract_label(r),

extract_features(r)))

Let's inspect the first record in the extracted feature RDD:

first_point = data.first()

print "Raw data: " + str(first[2:])

print "Label: " + str(first_point.label)

print "Linear Model feature vector:\n" +

str(first_point.features)

print "Linear Model feature vector length: " +

str(len(first_point.features))

You should see output similar to the following:

Raw data: [u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1',

u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']

Label: 16.0

Linear Model feature vector:

[1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.24,0.2879,0.81,0.0]

Linear Model feature vector length: 61

As we can see, we converted the raw data into a feature vector made up of the binary cat-

egorical and real numeric features, and we indeed have a total vector length of 61 .

Creating feature vectors for the decision tree

As we have seen, decision tree models typically work on raw features (that is, it is not re-

quired to convert categorical features into a binary vector encoding; they can, instead, be

used directly). Therefore, we will create a separate function to extract the decision tree

Machine Learning with Spark

Search WWH ::

Custom Search

Home